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Fence method for nonparametric small area estimation 


Jiming Jiang, Thuan Nguyen and J. Sunil Rao ' 


Abstract 


This paper considers the problem of selecting nonparametric models for small area estimation, which recently have received 
much attention. We develop a procedure based on the idea of fence method (Jiang, Rao, Gu and Nguyen 2008) for selecting 
the mean function for the small areas from a class of approximating splines. Simulation results show impressive 
performance of the new procedure even when the number of small areas is fairly small. The method is applied to a hospital 
graft failure dataset for selecting a nonparametric Fay-Herriot type model. 


Key Words: Fay-Herriot Model; Fence method; Nonparametric model selection; Penalized spline; Small area 


estimation. 


1. Introduction 


Small area estimation (SAE) has received increasing 
attention in recent literature. Here the term small area 
typically refers to a population for which reliable statistics 
of interest cannot be produced due to certain limitations of 
the available data. Examples of small areas include a 
geographical region (e.g., a state, county, municipality, efc.), 
a demographic group (e.g., a specific age x sex x race 
group), a demographic group within a geographic region, 
etc. In absence of adequate direct samples from the small 
areas, methods have been developed in order to “borrow 
strength”. Statistical models, especially mixed effects 
models, have played important roles in SAE. See Rao 
(2003) for a comprehensive account of various methods 
used in SAE. 

While there is extensive literature on inference about 
small areas using mixed effects models, including esti- 
mation of small area means which is a problem of mixed 
model prediction, estimation of the mean squared error 
(MSE) of the empirical best linear unbiased predictor 
(EBLUP; see Rao 2003), and prediction intervals (e.g., 
Chatterjee, Lahiri and Li 2007), model selection in SAE has 
received much less attention. However, the importance of 
model selection in SAE has been noted by prominent 
researchers in this field (e.g., Battese, Harter and Fuller 
1988, Ghosh and Rao 1994). Datta and Lahiri (2001) 
discussed a model selection method based on computation 
of the frequentist’s Bayes factor in choosing between a fixed 
effects model and a random effects model. They focused on 
the following one-way balanced random effects model for 
the sake of simplicity: y, =m+u, +e 
Jelze.. ak ime where thesuucsieand ei *s are normally 
distributed with mean zero and variances o, and oz, 
respectively. As noted by the authors, the choice between a 


Py esa 


fixed effects model and a random effects one in this case is 
equivalent to testing the following one-sided hypothesis 
H,: 0, =0 vs H,:o,>0. Note that, however, not all 
model selection problems can be formulated as hypothesis 
testing. Fabrizi and Lahiri (2004) developed a robust model 
selection method in the context of complex surveys. Meza 
and Lahiri (2005) demonstrated the limitations of Mallows’ 
C,, Statistic in selecting the fixed covariates in a nested 
error regression model (Battese, Harter and Fuller 1988), 
defined as y, =x, B+ u, + @,, i 
where y, is the observation, x, is a vector of fixed 
covariates, 8 is a vector of unknown regression coef- 
ficients, and w,’s and e,,’s are the same as in the model 
above considered by Datta and Lahiri (2001). Simulation 
studies carried out by Meza and Lahiri (2005) showed that 
the C,, method without modification does not work well in 
the current mixed model setting when the variance co; is 
large; on the other hand, a modified C,, criterion developed 
by these latter authors by adjusting the intra-cluster 
correlations performs similarly as the C, in regression 
settings. It should be pointed out that all these studies are 
limited to linear mixed models, while model selection in 
SAE in a generalized linear mixed model (GLMM) setting 
has never been seriously addressed. 

Recently, Jiang etal. (2008) developed a new strategy 
for model selection, called fence methods. The authors noted 
a number of limitations of the traditional model selection 


Leen Wie ja Soon 


strategies when applied to mixed model situations. For 
example, the BIC procedure (Schwarz 1978) relies on the 
effective sample size which is unclear in typical situations of 
SAE. To illustrate this, consider the nested error regression 
model introduced above. Clearly, the effective sample size 
is not the total number of observations n = >.” ,n,, neither is 
proportional to m, the number of small areas unless all the 
n, are equal and fixed. The fence methods avoid such 


1. Jiming Jiang, University of California, Davis. E-mail: jiang@wald.ucdavis.edu; Thuan Nguyen, Oregon Health and Science University; J. Sunil Rao, 


Case Western Reserve University. 


4 Jiang, Nguyen and Rao: Fence method for nonparametric small area estimation 


limitations, and therefore are suitable to mixed model 
selection problems, including linear mixed models and 
GLMMs. The basic idea of fence is to build a statistical 
fence to isolate a subgroup of what are known as the correct 
models. Once the fence is constructed, the optimal model is 
selected from those within the fence according to a criterion 
which can incorporate quantities of practical interest. More 
details about the fence methods are given below. 

The focus of this paper is nonparametric models for 
SAE. These models have received much recent attention. In 
particular, Opsomer, Breidt, Claeskens, Kauermann and 
Ranalli (2007) proposed a spline-based nonparametric 
model for SAE. The idea is to approximate an unknown 
nonparametric small-area mean function by a penalized 
spline (P-spline). The authors then used a connection 
between P-splines and linear mixed models (Wand 2003) to 
formulate the approximating model as a linear mixed model, 
where the coefficients of the splines are treated as random 
effects. Consider, for simplicity, the case of univariate 
covariate. Then, a P-spline can be expressed as 


f(x) = Boer bye tat Dee 


ER tea )at tee hoe Se (1) 


where p is the degree of the spline, g is the number of 
knots, «,,1< j7<q are the knots, and x, =xl;,,o). 
Clearly, a P-spline is characterized by p, q, and also the 
location of the knots. Note that, however, given p, q, the 
location of the knots can be selected by the space-filling 
algorithm implemented in R_ [cover.design()]. But the 
question how to choose p and gq remains. The general 
“rule of thumb” is that p is typically between | and 3, and 
q proportional to the sample size, n, with 4 or 5 
observations per knot (Ruppert, Wand and Carroll 2003). 
But there may still be a lot of choices given the rule of 
thumb. For example, if » = 200, the possible choices for q 
range from 40 to 50, which, combined with the range of | to 
3 for p, gives a total of 33 choices for the P-spline. Our 
new adaptive fence method offers a data-driven approach 
for choosing p and q for the spline-based SAE model. 

The rest of the paper is organized as follows. The fence 
methods are described in section 2. In section 3 we develop 
an adaptive fence procedure for the nonparametric model 
selection problem. In section 4 we demonstrate the finite 
sample performance of the new procedure with a series of 
simulation studies. In section 5 we consider a real-life data 
example involving a dataset from a medical survey which 
has been used for fitting a Fay-Herriot model (Fay and 
Herriot 1979). Some technical results are deferred to the 
appendix. 
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2. Fence methods 


As mentioned, the basic idea of fence is to construct a 
statistical fence and then select an optimal model from those 
within the fence according to certain criterion of optimality, 
such as model simplicity. Let Q,, =Qy(¥% 9) be a 
measure of lack-of-fit, where y represents the vector of 
observations, M indicates a candidate model, and 0,, 
denotes the vector of parameters under M. Here by lack-of- 
fit we mean that Q,, satisfies the basic requirement that 
E(Q,,) is minimized when M isa true model, and 0@,, the 
true parameter vector under M. Then, a candidate model 
M isin the fence if 


On = 0, ss C,, 3 M, M? (2) 


where OF =infy <0, Qu» Oy being the parameter space 
under M, M is a model that minimizes Q,, among 
M €/M, the set of candidate models, and SG, Py, is an esti- 
mate of the standard deviation of Q,, -Q,,. The constant 
c, on the right side of (2) can be chosen as a fixed number 
(e.g., c, =1) or adaptively (see below). 

The calculation of Oy, is usually straightforward. For 
example, in many cases Q,, can be chosen as the negative 
log-likelihood, or residual sum of squares. On the other 
hand, the computation of 6 vi ©an be quite challenging. 
Sometimes, even if an expression can be obtained for 
Ge , 4 its accuracy as an estimate of the standard deviation 
cannot be guaranteed in a finite sample situation. Jiang, 
Nguyen and Rao (2009) simplified an adaptive fence 
procedure proposed by Jiang ef al. (2008). For simplicity, 
we assume that 7 contains a full model, M,, of which 
each candidate model is a submodel. It follows that 
M= M,. In the simplified adaptive procedure, the fence 
inequality (2) is replaced by 


Or, -Oy, $6, (3) 


where c, is chosen adaptively as follows. For each 
Me, let p'(M)=P*{M,(c)=M} be the empirical 
probability of selection for M, where M,(c) denotes the 
model selected by the fence procedure based on (3) with 
c, =c, and P” is obtained by bootstrapping under M,. For 
example, under a parametric model one can estimate the 
model parameters under M, and then use a parametric 
bootstrap to draw samples under M,. Suppose that B 
samples are drawn, then p'(M) is simply the sample 
proportion (out of a total of B samples) that M is selected 
by the fence procedure based on (3) with the given c,. Let 
p =max,,.4,p (M). Note that p” depends on c,. Let 
c, be the c, that maximizes p’ and this is our choice. 
Jiang etal. (2008) offers the following explanation of the 
motivation behind adaptive fence. Suppose that there is a 
true model among the candidate models, then, the optimal 
model is the one from which the data is generated, and 
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therefore should be the most likely given the data. Thus, 
given c,, one is looking for the model (using the fence 
procedure) that is most supported by the data or, in other 
words, one that has the highest (posterior) probability. The 
latter is estimated by bootstrapping. Note that although the 
bootstrap samples are generated under M,, they are almost 
the same as those generated under the optimal model. This 
is because the estimates corresponding to the zero parameters 
are expected to be close to zero, provided that the parameter 
estimators under M, are consistent. One then pulls off the 
c,, that maximizes the (posterior) probability and this is the 
optimal choice. 

There are two extreme cases corresponding to c, =0 
and c,=0 (i.e, very large). Note that if c,=0, then 
p =1. This is because when c, =0 the procedure always 
chooses M,. Similarly, if there is a unique simplest model 
(e.g., model with minimum dimension), say, M,, then 
p’ =1 for very large c,. This is because, when c, is large 
enough, all models are in the fence, hence the procedure 
always chooses M,, if simplicity is used as the criterion of 
optimality for selecting the model within the fence. These 
two extreme cases are handled carefully in Jiang etal. 
(2008) and Jiang et al. (2009). However, as noted by Jiang 
etal. (2008), the procedures to handle the extreme cases, 
namely, the screen tests and baseline adjustment/threshhold 
checking, are rarely needed in practice. For example, in 
most applications there are a (large) number of candidate 
variables, and it is believed that only a (small) subset of 
them are important. This means that the optimal model is 
neither M, nor M,. Therefore, there is no need to worry 
about the extreme cases, and the procedures to handle these 
cases can be skipped. In most applications a plot of p* 
against c, is W-shaped with the peak in the middle 
corresponding to c’. 

The left plot of Figure 2 provides an illustration. This is a 
plot of p* against c, for the example discussed in section 
5. The plot shows the typical “W” shape, as described, and 
the peak in the middle corresponds to where the optimal c,,, 
£eimc, 1S. 

Jiang etal. (2009) established consistency of the 
simplified adaptive fence and studied its finite sample 
performance. 


3. Nonparametric SAE model selection 
For the simplicity of illustration we consider the 
following SAE model: 

Y= poke bere b= IM, (4) 
where y, is an n, x1 vector representing the observations 
from the i" small area; f(X,)=[f(% h<j<,, with f(x) 
being an unknown (smooth) function; B, is an n,xb 
known matrix; wu; is a b x 1 vector of small-area specific 


5 


random effects; and e, is an n,x1 vector of sampling 
errors. It is assumed that u,, e,, i=1, ..., m are independent 
with u,~ N(0, G,), G, =G, (8), and e,~ N(0, R,), R= 
R,(8), 8 being an unknown vector of variance components. 
Note that, besides /(X,), the model is the same as the 
standard “longitudinal” linear mixed model (e.g., Laird and 
Ware 1982, Datta and Lahiri 2000). 

The approximating spline model is given by replacing 
F(x) by f(x) in (1), where the coefficients B ’s and y’s 
are estimated by penalized least squares, i.e., by 


minimizing | y—XB-Zy/? + Aly, (5) 


Wheres vies (ii. iewe a tca(i0y em TOM Olek, 15 (1, Xp, 
une (2 j)" row of Z is Kehoe es? -K, 7; 
i=l, ..., m, j=l, ..., n,, and A 1s a penalty, or smoothing, 
parameter. To determine 1, Wand (2003) used the follow- 
ing interesting connection to a linear mixed model. To 
illustrate the idea, let us consider a simple case in which 
B, =0 (i.e., there is no small-area random effects), and the 
components of e, are independent and distributed as 
N(0, t”). If the y’s are treated as random effects which 
are independent and distributed as N(0, 0°), then the 
solution to (5) are the same as the best linear unbiased 
estimator (BLUE) for ff, and the best linear unbiased 
predictor (BLUP) for y, if A is identical to the ratio 
t’/o°. Thus, the value of 4 may be estimated by the 
maximum likelihood (ML), or restricted maximum like- 
lihood (REML) estimators of o° and 1° (e.g., Jiang 2007). 
However, there has been study suggesting that this approach 
is biased towards undersmoothing (Kauermann 2005). 
Consider, for example, a special case in which f(x) is, in 
fact, the quadratic spline with two knots given by (10). 
(Note that this function is smooth in that it has a continuous 
derivative.) It is clear that, in this case, the best approxi- 
mating spline should be f(x) itself with only two knots, 
i.e., g=2 (of course, one could use a spline with many 
knots to “approximate” the two-knot quadratic spline, but 
that would seem very inefficient in this case). However, if 
one uses the above linear mixed model connection, the ML 
(or REML) estimator of o° is consistent only if g 
(i.e., the number of appearances of the spline random effects 
goes to infinity). The seeming inconsistency has two worri- 
some consequences: (i) the meaning of A may be concept- 
tually difficult to interpret; (11) the behavior of the estimator 
of ~ may be unpredictable. 

The fence method offers a natural approach to choosing 
the degree of the spline, p, the number of knots, g, and the 
smoothing parameter, A at the same time. Note, however, a 
major difference from the situations considered in Jiang 
etal. (2008) and Jiang etal. (2009) in that the true 
underlying model is not among the class of candidate 
models, i.e., the approximating splines (1). Furthermore, the 
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role of 4 in the model should be made clear: 4 controls the 
degree of smoothness of the underlying model. A natural 
measure of lack-of-fit is Q,, = | vy-XB-Zy |’. However, 
On is not obtained by minimizing Q,, over B and y 
without constraint. Instead, we have Ou= = |) 26 B= Z¥/’, 
where 6 and y are the solution to (5), and hence depends 
on 2. The optimal A is to be selected by the fence method, 
together with p and q, as described below. 

Another difference is that there may not be a full model 
among the candidate models. Therefore, the fence inequality 
(3) is replaced by the following: 


OF if O27 as Ch» (6) 


where M is the candidate model that has the minimum 
QO,,. We use the following criterion of optimality within the 
fence which combines model simplicity and smoothness. 
For the models within the fence, choose the one with the 
smallest g; if there are more than one such models, choose 
the model with the smallest p. This gives the best choice of 
p and g. Once p,q are chosen, we choose the model 
within the fence with the largest 2. Once again, note that 1 
is part of the model M that is selected (or “estimated”’) by 
the fence method. The tuning constant c, is chosen 
adaptively using the simplified adaptive procedure of Jiang 
etal. (2009), where parametric bootstrap is used for 
computing p” (see section 2). 

The following theorem is proved in Appendix. For 
simplicity, assume that the matrix W=(X Z) is of full 
rank, Let P,, =I,—F,, where n= Qi"\n, and P, = 
W(W'Wy'W'. 


n 


Theorem. Computationally, the above fence procedure is 
equivalent to the following: (i) first use the as fence 
to select p and gq using (6) with A =0 and Of =y'Piy 
(see Lemma below), and same criterion as above for 
choosing p, q within the fence; (ii) let M4, denotes the 
model corresponding to the selected p and gq, find the 
maximum 2 such that 


A A 


Ones =O 56, (7) 


where for any model M with the corresponding Y and 
Z, we have 


Our = ly- XB, —Z4, : 
Beek eo Aa 


5 


q,, = W'U, +h Z'Zy'Z'(y - XB, ), 
ALP CO SHY) CVA NAVA VLOG 
AV Via Xk Ya, X ZUM NEL LY 7, 


and c, is chosen by the adaptive fence procedure described 
in section 2 (V, is defined below but not directly needed 
here for the computation because of the last two equations). 


Statistics Canada, Catalogue No. 12-001-X 


Note that in step (i) of the Theorem one does not need to 
deal with 4. The motivation for (7) is that this inequality is 
satisfied when 2 =0, so one would like to see how far A 
can go. In one the maximum 2 is a solution to the equation 
O,. 4. Oe . The purpose of the last two equations 1s 
to avoid ee inversion of V, =J,+A° 'ZZ', whose 
dimension is equal to n, the total cample Size Note that V, 
does not have a block diagonal structure because of ZZ', so 
if n is large direct inversion of V, may be computationally 
burdensome. 

The proof of the Theorem requires the following lemma, 
whose proof is given in Appendix. 


Lemma. For any M and jy, (he, 
function of 4 with inf,.) Oy ,=Qy. 


is an increasing 


4. Simulations 


We consider an extension of the Fay-Herriot model (Fay 
and Herriot 1979) in a nonparametric setting. The model can 
be expressed as 


Ma = f(x) +; +é, i=l, ..., M, (8) 


where v,,e,i=1,...,.m are independent such that 
v, ~ N(0, A), e, ~ N(0, D,), where A is unknown but the 
sampling variance D. is assumed known. The main 
difference from the traditional Fay-Herriot model is /(x,), 
where f(x) is an unknown smooth function. 

For simplicity we assume D, = D, 1 < i < m. Then, the 
model can be expressed as 


= fates w= lakes urine (9) 


where ¢, ~ N(0, 6°) with o* = 4+ D, which is unknown. 
Thus, the model is the same as the nonparametric regression 
model. 

We consider three different cases that cover various 
situations and aspects. In the first case, Case 1, the true 
underlying function is a linear function, f(x)=1-—~x, 
0 <x <1, hence the model reduces to the traditional Fay- 
Herriot model. The goal is to find out if fence can validate 
the traditional Fay-Herriot model in the case that it is valid. 
In the second case, Case 2, the true underlying function is a 
quadratic spline with two knots, given by 


f(x) =1-x4+x° —2(x-1)o +2(x-2)°, 0S x <3(10) 


(the shape is half circle between 0 and | facing up, half 
circle between | and 2 facing down, and half circle between 
2 and 3 facing up). Note that this function is smooth in that 
it has a continuous derivative. Here we intend to investigate 
whether the fence can identify the true underlying function 
in the “perfect” situation, 7.e., when f(x) itself is a spline. 
The last case, Case 3, is perhaps the most practical situation, 
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in which no spline can provide a perfect approximation to 
f(x). In other words, the true underlying function is not 
among the candidates. In this case f(x) is chosen as 
0.5sin(2mx), O<x<1, which is one of the functions 
considered by Kauermann (2005). 

We consider situations of small or medium sample size, 
namely, m= 10,15 or 20 for Case I, m= 30, 40 or 50 for 
Case 2, and m= 10, 30 or 50 for Case 3. The covariate x, 
are generated from the Uniform[0, 1] distribution in Case 1, 
and from Uniform[0, 3] in Case 2; then fixed throughout 
the simulations. Following Kauermann (2005), we let x, be 
the equidistant points in Case 3. The error standard deviation 
o in (9) is chosen as 0.2 in Case | and Case 2. This value is 
chosen such that the signal standard deviation in each case is 
about the same as the error standard deviation. As for Case 3, 
we consider three different values for 6, 0.2,0.5 and 1.0. 
These values are also of the same order as the signal standard 
deviation in this case. 

The candidate approximating splines for Case 1 and Case 
Zoaresinestollowing 70.1.2. 3¢nq — 0 and ope), 2, 3, 
q = 2,5 (so there are a total of 10 candidates). As for Case 
3, following Kauermann (2005), we consider only linear 
splines (7.e., p=1); furthermore, we consider the number 
of knots in the range of the “rule of thumb” (Z.e., roughly 4 
or 5 observations per knot; see section 1), plus the intercept 
model (p=g=0) and the linear model (p=1, g=0). 
ausedOrmst Oe Oea0 2 e322 One 30g — 0, 0.07.8: 
and fon i= O0_<o\—= OO lileel 25 138 

Table | shows the results based on 100 simulations under 
Case | and Case 2. As in Jiang ef al. (2009), we consider both 


Table 2 


7 


the highest peak, that is, choosing c, with the highest p’, 
and 95% lower bound (L.B.), that is, choosing a smaller c,, 
corresponding to a peak of p’ in order to be conservative, if 
the corresponding p’ is greater than the 95% lower bound of 
the p’ for any larger c, that corresponds to a peak of p’. It 
is seen that performance of the adaptive fence is satisfactory 
even with the small sample size. Also, it appears that the 
confidence lower bound method works better in smaller 
sample, but makes almost no difference in larger sample. 
These are consistent with the findings of Jiang et a/. (2009). 


Table 1 

Nonparametric model selection - Case 1 and Case 2. Reported 
are empirical probabilities, in terms of percentage, based on 
100 simulations that the optimal model is selected 


Case 1 Case 2 
Sample size m=10 m=15 m=20 m=30 m=40 m=50 
Highest Peak 62 91 97 71 83 97 
Confidence L.B. 73 90 97 73 80 96 


Table 2 shows the results for Case 3. Note that, unlike 
Case | and Case 2, here there is no optimal model (an 
optimal model must be a true model according to our 
definition). So, instead of giving the empirical probabilities 
of selecting the optimal model, we give the empirical 
distribution of the selected models in each case. It is 
apparent that, as o increases, the distribution of the models 
selected becomes more spread out. A reverse pattern is 
observed as m increases. The confidence lower bound 
method appears to perform better in picking up a model 
with splines. Within the models with splines, fence seems to 
overwhelmingly prefer fewer knots than more knots. 


Nonparametric model selection - Case 3. Reported are empirical distributions, in terms of percentage, of the selected models 


m=10 
O58 


Sample Size 
# of Knots 


oe 02 Highest Peak 


Confidence L.B. 


Highest Peak 


Confidence L.B. 


Highest Peak 


Confidence L.B. 


m=30 m=50 
0, 6, 7,8 0,10, 11, 12, 13 
(P; 9) % (P; 9) % 
(1, 0) 9 (1, 10) 100 
(1, 6) 9] 
(1, 0) 9 (1, 10) 100 
(1, 6) 9] 
(1, 0) 21 (1, 0) 13 
(1, 6) 7 (1, 10) 84 
(Cl) 2 (Gil, WN) 2 
(2) l 
(1, 0) 8 (1, 0) 2 
(1, 6) 89 (1, 10) 94 
(1,7) 3 (1, 11) O, 
(et) D, 
(0, 0) 15 (0, 0) 10 
(1, 0) 18 (1, 0) 26 
(1, 6) 63 (1, 10) 60 
(ES, 78) 4 (lei) 2 
(alee 2;) 2 
(0, 0) 1 (0, 0) 2 
(1, 0) 13 (1, 0) 13 
(1, 6) 82 (1, 10) 80 
(1, 7) 4 ibe Si 2 
(Wee) 3 


Statistics Canada, Catalogue No, 12-001-X 


Note that the fence procedure allows us to choose not 
only p and q but also A (see section 3). In each 
simulation we compute B = B, and Y=y,, given below (7), 
based on the A chosen by the adaptive fence. The fitted 
values are calculated by (1) with B and y replaced by B 
and y, respectively. We then average the fitted values over 
the 100 simulations. Figure 1 shows the average fitted 
values for the three cases (m=10, 30,50) with o=0.2 
under Case 3. The true underlying function values, f(x, ) = 
0.5 sin(27x, ), i=1, ..., m are also plotted for comparison. 


5. A real-life data example 


We consider a dataset from Morris and Christiansen 
(1995) involving 23 hospitals (out of a total of 219 
hospitals) that had at least 50 kidney transplants during a 27 
month period (Table 3). The y,’s are graft failure rates for 
kidney transplant operations, that is, y, =number of graft 
failures /n,, where n, is the number of kidney transplants at 
hospital i during the period of interest. The variance for 
graft failure rate, D,, is approximated by (0.2) (0.8) /n,, 
where 0.2 is the observed failure rate for all hospitals. Thus, 
D, is assumed known. In addition, a severity index x, is 
available for each hospital, which is the average fraction of 
females, blacks, children and extremely ill kidney recipients 
at hospital i. The severity index is considered as a covariate. 


Table 3 
Hospital data from Morris and Christiansen (1995) 


Area V; X; 4 D; 


l 0.302 0.112 0.055 
2 0.140 0.206 0.053 
3 0.203 0.104 0.052 
4 0.333 0.168 0.052 
5 0.347 0.337 0.047 
6 0.216 0.169 0.046 
7 0.156 0.211 0.046 
8 0.143 0.195 0.046 
9 0.220 0.221 0.044 
10 0.205 0.077 0.044 
11 0.209 0.195 0.042 
12, 0.266 0.185 0.041 
1) 0.240 0.202 0.041 
14 0.262 0.108 0.036 
15 0.144 0.204 0.036 
16 0.116 0.072 0.035 
17 0.201 0.142 0.033 
18 0.212 0.136 0.032 
19 0.189 0.172 0.031 
20 0.212 0.202 0.029 
21 0.166 0.087 0.029 
22 0.173 0.177 0.027 
23 0.165 0.072 0.025 


Ganesh (2009) proposed a Fay-Herriot model for the 
graft failure rates. as follows: y, =B, +B,x, +v, +e, where 
the v,’s are hospital-specific random effects and e,’s are 
sampling errors. It is assumed that v,, e, are independent 
with v,~ N(0, A) and e, ~ N(0, D,). Here the variance 
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A is unknown. Based on the model Ganesh obtained credi- 
ble intervals for selected contrasts. However, inspections of 
the raw data suggest some nonlinear trends, which raises the 
question on whether the fixed effects part of the model can 
be made more flexible in its functional form. 

To answer this question, we consider the Fay-Herriot 
model as a special member of a class of approximating spline 
models discussed in section 3. More specifically, we assume 


i= f @) oat €, T= eae (11) 


where f(x) is an unknown smooth function and everything 
else are the same as in the Fay-Herriot model. We then 
consider the following class of approximating spline models: 


f(x) = By gtiBy echo Bart 


+ Y= hp ees Ke, (12) 


with -p=O0sl)233"andy¢—0 .\5.5,0 (D0 Is-onlyetor 
q=0). Here the upper bound 6 is chosen according to the 
“rule-of-thumb” (because m=23, so m/4=5.75). Note 
that the Fay-Herriot model corresponds to the case p= 1 
and g = 0). The question is then to find the optimal model, in 
terms of p and q, from this class. 

We apply the adaptive fence method described in section 
3 to this case. Here to obtain the bootstrap samples needed 
for obtaining c,, we first compute the ML estimator under 
the model M, which minimizes By = y'P,. y among the 
candidate models [i.e., (12); see Theorem in section 3], then 
draw parametric bootstrap samples under model M with 
the ML estimators treated as the true parameters. This is 
reasonable because M is the best approximating model in 
terms of the fit, even though under model (11) there may not 
be a true model among the candidate models. The bootstrap 
sample size is chosen as 100. 

The fence method selects the model p=3 and qg=0O, 
that is, a cubic function with no knots, as the optimal model. 
To make sure that the bootstrap sample size B=100 is 
adequate, we repeated the analysis 100 times, each time 
using different bootstrap samples (recall in the adaptive 
fence one needs to draw bootstrap samples in order to 
determine c;, so the question is whether different bootstrap 
samples lead to different results of model selection). All 
results led to the same model: a cubic function with no knots 
(even though the bootstrap-derived intermediate quantities, 
such as p’ and c., varied across bootstraps). We also ran 
the data analysis using B=1,000, and selected model 
remained the same. Thus, it appears that the bootstrap 
sample size B= 100 is adequate. The left figure of Figure 2 
shows the plot of p” against c, in the adaptive fence 
model selection. 


Survey Methodology, June 2010 


* 
a 
x 
Se 
Oo 
cast Oo 
“oS 
=s 
= 
x 
s 
Get 
Xx 
Figure 1 Case 3 Simulation. Top figure: Average fitted values for m= 10. Middle figure: Average fitted 


values for m=30. Bottom figure: Average fitted values for m= 50. In all cases, the dots 
represent the fitted values, while the circles correspond to the true underlying function 


A few comparisons are always helpful. Our first 
comparison is to fence itself but with a more restricted space 
of candidate models. More specifically, we consider (12) 
with the restriction to linear splines only, ie, p=1, and 
knots in the range of the “rule of thumb”, i.e, g = 4,5, 6, 
plus the intercept model (p =gq=0) and the linear model 
(p =1, g=0). In this case, the fence method selected a 
linear spline with four knots (7e, p=1, q=4) as the 
optimal model. The value of 2 corresponding to this model 
is approximately equal to 0.001. The plot of p” against c, 
for this model selection is very similar to the left figure of 
Figure 2, and therefore omitted. In addition, the right figure 
of Figure 2 shows the fitted values and curves under the two 
models selected by the fence from within the different 
model spaces as well as the original data points. 

A further comparison can be made by treating (11) as a 
generalized additive model (GAM) with heteroscedastic 
errors. A weighted fit can be obtained with the amount of 
smoothing optimized by using a_ generalized cross- 
validation (GCV) criterion. Here the weights used are 
w, =1/(A+D,) where the maximum likelihood estimate 
for A is used as a plug-in estimate. Recall that the D,’s are 
known. This fitted function is also overlayed in the right 


figure of Figure 2. Notice how closely this fitted function 
resembles the restricted space fence fit. 

To expand the class of models under consideration by 
GCV-based smoothing, we used the BRUTO procedure 
(Hastie and Tibshirani 1990) which augments the class of 
models to look at a null fit and a linear fit for the spline 
function; and embeds the resulting model selection (i.e., 
null, linear or smooth fits) into a weighted backfitting 
algorithm using GCV for computational efficiency. 
Interestingly here, BRUTO finds simply an overall linear fit 
for the fixed effects functional form. While certainly an 
interesting comparison, BRUTO’s theoretical properties for 
models like (11) have not really been studied in depth. 

Finally, as mentioned in section 3, by using the 
connection between P-spline and linear mixed model one 
can formulate (12) as a linear mixed model, where the spline 
coefficients are treated as random effects. The problem then 
becomes a (parametric) mixed model selection problem, 
hence the method of Jiang et al. (2009) can be applied. In 
fact, this was our initial approach to this dataset, and the 
model we found was the same as the one by BRUTO. 
However, we have some reservation about this approach, as 
explained in section 3. 
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Figure 2 Left: A plot of p* against c, from the search over the full model space. Right: The raw data and the 
fitted values and curves; dots and their curve correspond to the cubic function resulted from the full model 
search; squares and their lines correspond to the linear spline with 4 knots resulted from the restricted 


model search; green 


6. Concluding remarks 


Although the focus of the current paper is nonparametric 
SAE model selection, our method may be applicable to 
spline-based mixed effects model selection problems in 
other areas, for example, in the analysis of longitudinal data 
(e.g., Wang 2005). 

In the case where a true model exists among the 
candidate models, such as Cases | and 2 in section 4, 
consistency of the proposed fence model selection method 
can be established in the same way as in Section 3 of Jiang 
et al. (2009) (although the result of the latter paper does not 
directly apply). However, practically, the situation that non- 
parametric modeling is most useful is when a true model 
does not exist, or is not among the candidates, such as Case 
3 in section 4. In this case, no result of consistency can be 
proved, of course. It remains unclear what is a desirable 
asymptotic behavior to study in the latter case. 
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Appendix 


1. Proof of Lemma. Write g(A)= Owns. It can be shown 
(detail omitted) that g’(A)=2Ay'B, 4, Bi y, where A, = 
B'(W'W +2BB') |B, B, =W(W'W +2BB')'B with B’ = 
(O/7,) and W =(X Z). Hence eA) Oy tore a= 0) 
Also Oy, > Oy as X90. 
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*s and their lines represent the GAM fits 


2. Proof of Theorem. Consider the fence inequality 


A 


Chis = boss Se. (A.1) 


where (M, 2) minimizes Q,,,. Also consider the fence 
. . . KX aS . . 
inequality using Q,, = y Fy, which is 


Ov -O. = Cy: 


By Lemma, we must have X=0, and M =M Meiicnce 
Or = On. It follows, again by Lemma, that for the same 
c,, (A.2) holds if and only if (A.1) holds for some A. 
Therefore, the models within the fence, in terms of p and 
q, are the same under both procedures. It is then easy to 
see, according to the selection criterion, that the same model 
M,=M,(c,), in terms of p and q, will be selected under 
both procedures for the given c,. It then follows that the c’ 
selected using the adaptive procedure will be the same under 
both procedures. Then, once again using the above 
argument, the optimal model M>, in terms of p and q, 
will be the same under both procedures. 

The formulae below (7) can be derived using the 
expressions of BLUE and BLUP (e.g., Jiang 2007, §2.3.1) 
and the following identity (e.g., Sen and Srivastava 1990, 
pase 275) HO is’ wixg “and, WV" 1s "Gj tnien 
(PUY) =P ESPs Ut Ps i) ) Pn so alongeas 
the inverses exist. 


(A.2) 
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Gross flow estimation in dual frame surveys 


Yan Lu and Sharon Lohr ' 


Abstract 


Gross flows are often used to study transitions in employment status or other categorical variables among individuals in a 
population. Dual frame longitudinal surveys, in which independent samples are selected from two frames to decrease survey 
costs or improve coverage, can present challenges for efficient and consistent estimation of gross flows because of complex 
designs and missing data in either or both samples. We propose estimators of gross flows in dual frame surveys and examine 
their asymptotic properties. We then estimate transitions in employment status using data from the Current Population 


Survey and the Survey of Income and Program Participation. 


Key Words: Complex surveys; Dual frame surveys; Jackknife; Longitudinal estimation; Missing data. 


1. Introduction 


Many current surveys follow the same individuals at 
regular time intervals so that longitudinal quantities such as 
transitions in employment status and poverty status can be 
studied. The U.S. Current Population Survey (CPS; United 
States Census Bureau 2006), for example, uses a rotating 
panel design in which persons in a housing unit selected for 
the survey are interviewed for four consecutive months, 
rested for eight months, and then interviewed again for four 
consecutive months. This design allows estimation of 
quantities related to individuals’ changes over time. Since 
many survey responses are categorical, gross flows, which 
are transitions among states of a categorical variable over 
time, are particularly important. 

Table 1 displays the counts of a categorical variable 
measured at two times in a population of N units. At time 
1, the variable can be in one of r states and at time 2, the 
variable can be in one of c states. To illustrate Table 1, we 
give the following example. In studying changes in 
employment status, we might have r = 2 and c = 2, with 
state 0 representing unemployment and state 1 representing 
employment. Then X,, gives the count of persons in the 
population who are unemployed at both times, X,, is the 
number of persons who are employed at time | but un- 
employed at time 2, X,, is the total number of persons who 
are unemployed at time 1, and so on. It is of interest to 
obtain estimates and standard errors of the gross flows X’,,,, 
k =0,..,7r—1,/=0,...,c—1, using survey data. This 
can be complicated in practice because of missing data and 
other problems. 

While successive cross-sectional estimates can assess a 
change in unemployment rates over time, only a longi- 
tudinal survey addresses issues such as persistence of 
unemployment in individuals. Gross flow estimation using 
survey data has been studied by many authors, including 


Chambers, Woyzbun and Pillig (1988), Hocking and 
Oxspring (1971), Blumenthal (1968), Chen and Fienberg 
(1974), Stasny (1984, 1987), and Stasny and Fienberg 
(1986). Most of this work considered methods for obtaining 
maximum likelihood (ML) estimators for expected cell 
values in contingency tables with partially cross-classified 
data. Pfeffermann, Skinner and Humphreys (1998) proposed 
estimators that account for misclassification in survey data. 
All of this work has assumed that a probability sample, 
usually a simple random sample, has been taken from a 
single sampling frame. 


Table 1 
Gross flow table for population 
Time 2 
0 1 2 c-1 

0 X00 Xo1 Xo Xoct os 
Time 1 1 X19 X11 X12 OS Xy 

2 X59 X>| Xe X 2. c-l Xan 

rat X09 Ap) Arai2 Xpajet Xp-t4 
X49 X 4 X42 xX, c= N 


A number of longitudinal surveys, such as the Canadian 
National Longitudinal Survey of Children and Youth and 
the Canadian Household Panel Survey, have now started or 
are considering implementation of a dual frame or multiple 
frame design. In a multiple frame survey, probability 
samples are selected independently from two or more 
frames. Using more than one frame often gives better 
coverage of the population, and can achieve considerable 
cost savings in some populations. For example, the Assets 
and Health Dynamics Survey (Heeringa 1995), with the 
goal of estimating characteristics of the population aged 
over 65, used a dual frame survey in which frame A was 
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the frame for a national general population survey and frame 
B was a list of Medicare enrollees. The structure of this 
survey is illustrated in Figure 1. Frame 4 covered the entire 
population but required extensive screening to identify 
individuals in the target population and was thus expensive 
to sample from; frame B was less expensive to sample, but 
did not include the entire population. Kalton and Anderson 
(1986) described uses of dual frame surveys to sample rare 
populations; Blair and Blair (2006) argued that dual frame 
surveys can take advantage of less expensive sampling 
modes such as internet sampling when sampling rare 
populations. 


> 
ve 


Figure 1 Frame B isa subset of frame A 


Figure 2 Frames A and B are both incomplete but overlapping 


In other situations, both frames may be incomplete, as 
depicted in Figure 2. Hartley (1962, 1974) first proposed 
estimators for the dual frame survey design in Figure 2, 
when independent samples are taken from each frame. 
Subsequent developments are given in Bankier (1986), 
Fuller and Burmeister (1972), Skinner and Rao (1996), and 
Lohr and Rao (2000). Lohr and Rao (2006) summarized 
methods for estimating population quantities in cross- 
sectional multiple frame surveys. 

In this paper, we propose estimators for gross flows that 
can be applied to dual frame surveys in which longitudinal 
information is collected in one or both samples. Units 
sampled in one or both surveys are followed over time; in 
some cases, additional units are sampled at later times to 
incorporate new population units or compensate for attri- 
tion. A longitudinal dual frame survey presents additional 
challenges to those found in longitudinal single frame 
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surveys or in cross-sectional dual frame surveys. Missing 
data can occur in the sample from either frame, and units 
may change frame membership between interviews in the 
survey. In addition, either sampling design may be complex, 
with stratification and clustering. In an overlapping dual 
frame survey such as that depicted in Figure 2, one wishes 
to use the information in the overlap as efficiently as 
possible. The problem studied in this article is to use all the 
information sampled from frame A and frame B to 
estimate the transition probabilities of the population. 

The article is organized as follows. In Section 2, we set 
up the research problem. In Section 3, we derive gross flow 
estimators in dual frame surveys for complex samples with 
possibly missing data. In Section 4, we derive asymptotic 
properties and discuss variance estimation. An application 
of our research to the Current Population Survey and Survey 
of Income and Program Participation is given in Section 5. 
Finally, we give our conclusions in Section 6. 


2. Notation and sample quantities 


Suppose there are two sampling frames, frame A and 
frame B, which together cover the population of interest 
AU B as shown in Figure 2. In Hartley’s (1962) notation, 
there are three nonoverlapping domains: a = 4B’, 
b= A OB, and ab = AMB, where c denotes com- 
plement of a set. The population sizes for frames A and B 
are N, and N,, with domain population sizes N_, N,, 
and N_,. We assume that NV, and NV, are known, but the 
population size N = N, + N, — N_., may be unknown. In 
this article, we assume that both the population and the 
frames are fixed over time. These are strong assumptions 
but in many longitudinal surveys the population of interest 
and the frames may be defined for time 1. 

Assume for this section that domain membership is 
constant over time. For simplicity of notation in this paper 
we assume that r = 2 and c =2 so that there are two 
possible categories at each time; the general case is similar. 
Since the three domains are nonoverlapping, each popu- 
lation count X,,, k= 0,1, 7 =0, 1, can be written as X,, = 
X pig +X pop + Xp». Where X,,, 1S the number of popu- 
lation units in domain d that are in state k at time | and 
state / at time 2. The corresponding population and domain 
probabilities are p, = X,/N and py = X,y4/N, for 
d € {a, ab, b\. 

Independent probability samples, S$, and S,, with 
sample sizes n, and n,, are taken from frames 4 and B. 
Let w* be the weight of sampled unit i for the sample 
from frame A and let w, be the weight of sampled unit 7 
for the sample from frame B. We may take w;* to be the 
sampling weight [P(i ¢ S,)]' or a Hajek-type weight 
[Pi < S,)] 'N,,/ (sum of sampling weights in S,). Other 
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weighting schemes for longitudinal data, discussed in 
Verma, Betti and Ghellini (2007) and Lavallée (2007), 
might also be used. Let y ; = (V4 ¥;.) be the response for 
unit 7 in S,, with y,, yj. € (0,1, Mj sg M denotes 
that the value is missing. Then Xi, = = Yies, my I(y;, = k) 
I(y,, = 1G ¢€a) and Bee = Dies, Wj ivase & 
I(y,. = 1) Ii € ab) estimate the population counts for the 
(k, /) cell in domains a and ab from S,, for k, 
WetO0sleMy. Let Ye = Oj Vj) be ine response for 
Unitis feitenSaeeand plete. = ey we I(yy = 4) 
[Gin =) Geb) cand Xe) es WO = KI »=2) 
I(j € ab) be the corresponding estimators from S,. 

In this paper, we assume that domain membership can be 
determined for every sample unit and that the responses y, 
have no classification error. Thus, we assume that we know 
whether each unit in the frame A or frame B sample 
belongs to the other frame or not. We also assume that there 
is nO measurement error for y,; and y, — in the employ- 
ment example, this means that every respondent gives the 
correct response for his or her employment status. Thus, the 
methods we proposed in our article are sensitive to mis- 


classification of observations into domains and into cells. If 


the domain means differ or if observations are classified 
incorrectly, the estimators of gross flows could be biased; 
Pfeffermann ef al. (1998) discussed methods of accounting 
for misclassification in single frame surveys. 

The estimators from S, are displayed in Table 2. A 
similar table may be constructed for the estimators from S,. 
We assume that each unit is sampled during one or both 
time periods. If there is no missing data, then all the 
estimated counts for cells (k, M@) and (M,/) are zero. 
Using the exact or approximate unbiasedness of the esti- 
mators, depending on whether the sampling or Hajek 
weights are used, when there is no missing data, E[X ‘ti = 


A 
X hia? El Xie EX = Xi, and E[Xin] a X yy. 
Table 2 
Estimators from the frame A sample 
Time 2 
0 1 Missing 
nA A n A nA 
0 X 00a X Ola X 0Ma X0+a 
, A ra A ~A 
domain a 1 X10a Xia X Ma X\+a 
nA A wi 
Time 1 Missing *M0a X Mla X M+a 
be ne A ~A 
0 X00ab_  X0lab  X0Mab X 0+ab 
F A ~ A aA 
domain ab 1 X10ab = XI lab X Mab X\+ab 
~ A aed ~A 
Missing X Moab X Miab X M+ab 
~ A aa ~A F 
X +0 GA X+M N4 


TS 


3. Gross flow estimators in dual frame surveys 


In this section, we derive gross flow estimators for com- 
plex samples in dual frame surveys. A dual frame pseudo- 
likelihood approach is used to account for the sampling 
designs and missing data mechanism. A dual frame ap- 
proach can improve precision of the estimators and provide 
more flexibility to model the missing data mechanism. 
Methods in current use for handling missing data are based 
on standard statistical methods and fall into four general 
categories (Little and Rubin 2002): complete-case analysis, 
weighting methods, imputation methods and model-based 
methods. We adopt a model-based approach for the missing 
data. In this section, we first consider a simple setup with 
simple random samples from a population with no missing 
data. Then we add a model for the missing data mechanism. 
Finally, we discuss estimators for more complex survey 
designs. 


3.1 Simple random samples with complete data 


To motivate the estimator in the general case, we first 
study estimation of gross flows when there is no missing 
data and when the sample from each frame is a simple 
random sample. Then xj, =1,Xjj,/N,, for d = a, ab, is 
the observed sample count in cell A/ and domain d from 
Sj; xp, =n,X§,/N, for d = b, ab is the corresponding 
observed sample count from S,. 

If the sampling fractions are small, a multinomial ap- 
proximation may be used for the likelihood. For the sample 
from frame A, there are eight cells with associated proba- 
bilities PJ, = PygN4/N,,. for &, 1 €{0, 1} and de {a, ab}. 
The related probabilities for the sample from frame B are 
Pe = Bag NglN ar, tot ok, 1-e {0,1} and .d.e {b, ab}. 
Using the multinomial distribution and the assumption that 
the samples from the two frames are selected independently, 
the likelihood function is 


L(p, N,,) © I] (des x I] (ee ial 
k,l,d k,l,d 
Although the likelihood is written for simplicity in terms of 
P* and P®., the underlying parameters of interest are 
P = (Poow Poa + Pry) and N 
Setting the partial derivatives of the loglikelihood with 
respect to the parameters equal to zero, the maximum 


ab* 


likelihood estimators are Py, = Xiq/M, > Pray = */My and 
2 ws B A B = 

Pra = Xptar + Xian)! Man + Nay) Where 1, ms 7, mish 
IG € ab), jee Dee ab), ni =n, —n%, and n, = 


nz —n.,. The MLE for N., Menez ts the oie! root of the 
quadratic coun 


—[n,NgtngN,t+ nN, ead oN glides 
nin | NyNz = 0. (1) 


ee oe 


+([n4 


ab 
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Finally, using the above results, we construct the MLEs for 
Xj, and py: 


Xx BF (N, x Nea Day str Nae iis (Nz a Nw ) Pra 
Pp = (Ny, = libs ) Pita i N a» Puab ul (Nz = N25) Pra 
- NEP Ns 


These estimators are the same as those obtained by 
Skinner (1991). However, Skinner used the approximate 
normal distribution of the response mean y in each domain 
to obtain the MLEs, while our estimators come from a 
multinomial model. The multinomial model allows us to 
include partially classified information from units observed 
at only one time period, as shown in the next section. 


3.2 Simple random samples with missing data 


In practice, individuals may appear in the sample at only 
one of the times. This can occur due to sample attrition 
(when members of the sample drop out during the course of 
a study) or other causes. In a rotating panel survey such as 
the CPS, persons rotating out of the survey at time | will not 
be contacted for time 2 and thus their time-2 employment 
status will be unknown. In other situations, one of the sam- 
ples may be cross sectional, in which case all observations 
are measured at exactly one time. 


3.2.1 


Blumenthal (1968), Chen and Fienberg (1974), Stasny 
(1984, 1987) and Stasny and Fienberg (1986) used a two- 
phase procedure to model the missing data in a single 
sample. A model is proposed for the complete data, and then 
the missing data mechanism is modeled. We extend this 
procedure to our dual frame structures. One advantage of a 
dual frame survey is that it provides more flexibility for the 
missing data models. 

First, we assume that if all units were measured at both 
times, the model in Section 3.1 could be used. For the non- 
response mechanism, assume that each observation in cell 
(k,1) and domain d from S, has probability 2, of 
being missing at time 1 and probability wy, of being 
missing at time 2. We assume the unit cannot be missing at 
both times. 

This formulation assumes a constant probability that an 
observation will be missing within a given cell, domain, and 
frame. If data could be missing for different reasons, 
additional parameters could be used to distinguish obser- 
vations that have partial classification because of, say, the 
rotating panel design, and observations that have partial 
classification because of nonresponse. In Section 5, we 
discuss an alternative approach that might be used with 
multiple mechanisms for missing data. 


Model for missing data 
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For k,/ € {0,1}, the probability that a unit from S, is 
observed in cell (k, /) and domain d is 


On a ie ( ia 


The probability that a unit from S, 
(k, M) and domain d is 


= Wit 2: 


is observed in cell 


1 
Diva = 2 Pita Viua- 
1=0 
Similarly, the probability that a unit from S, is observed in 
cell (M,/) and domain d is 


| 
A AJA 

Owvna 5 De Poa iua- 
k=0 


The probabilities for frame B are defined similarly with 
Og = Pig = dg —“Wita)> Oivd =Xiz0PiaVing and QWna = 
Dk=0 Pita Oita 

Under this two phase model, and using the assumption of 
independence of the samples, the likelihood function for the 
two samples is: 


Lip. vd Nu) TE TT TT iy 


ke{0, 1} Je{0, 1} defa, ab} 


x I] | I] (One 


ke{0, 1} /e{0, 1} de{b, ab} 


x I] il (OF ye 


ke{0, 1} de{a, ab} 


x I] I] (04,4) 


le{0,1} de{a, ab} 


«TT TT @iay™ 


ke{0,1} de{b, ab} 


soll blaine (2) 


1e{0,1} de{h, ab} 


where w is the vector of wi,’s and w7,,’s and is the 
vector of oj, ’sand oj), ’S. 

The expression in (2) is for the most general model, in 
which both surveys are longitudinal and both have missing 
data at each time period. If frame A uses a rotating panel 
survey, for example, then all of the probabilities OF are 
nonzero: the units in the panels measured at both time 
periods will be included in the estimators xj, for 
k, 1 € {0, 1} , the units in the panels leaving the survey after 
time | will be included in the estimators x;\,,, and the units 
in the incoming panels will be included in the estimators 
X{ia- Depending on the structure of the surveys, some of 
the factors in (2) may be omitted. For example, if the survey 
from frame B is a repeated cross-sectional survey with 
small sampling fraction, the probabilities 07, for 
k, 1 € {0, 1} will be close to zero, and we would omit those 
factors from the likelihood. 
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The likelihood in (2) can be written as a product of a 
factor with N,, and a factor containing the remaining 
parameters. As a consequence, the MLE for NV, is again 


the smaller root of the equation in (1). We discuss the 
estimators of the remaining parameters in the next section. 


3.2.2 Model identifiability and reduced models 


A problem with maximizing the likelihood in (2) is that 
under the general model there are a total of 42 parameters 
while the two samples have only 32 observed cell counts. 
Thus we cannot estimate all the parameters under the most 
general model. But we can consider models with reduced 
parameterizations, as done in Chen and Fienberg (1974) for 
single frame surveys. The dual frame situation, in fact, gives 
much more flexibility for modeling the missing data 
because of the independent information from the two 
samples about domain ab. 

We first state conditions for a reduced model to be 
locally identifiable. Let @ denote the s-vector of para- 
meters of interest; in our case, 8 would include linearly 
independent components of p, V,,/N, and parameters for 
the missing data mechanism. In the likelihood in (2), the 
probabilities from the independent multinomial samples are 
Q;, and Q/,. These probabilities may be written as 
functions of @, with Q*(@) = (Oy), «+s Oa») @ Z -vector 
Of the monzerol@,,, ‘s.and=Q"(@) = (Of, 4 Ona) 2 ¢- 
vector of the nonzero Q;), ’s. When all cells in Table 2 and 
the analogous table for frame B have nonzero probabilities, 
g=q=16. Let D=(D',, D;,)’ be the derivative matrix 
of the transformation, with D4.) = 0Q¢/00, and 
Dap) = 0Q;/005 LOT NOp=s land. 7o sl, 0 =e gi 1, 
and $6 =1,...,5. Then, using Theorems 3, 4 and 5 in 
Catchpole and Morgan (1997), the model is locally iden- 
tifiable if the matrix D is of full rank. The proof for the dual 
frame situation is given in Lu (2007). 

In a dual frame survey, we consider two types of models 
for the missing data. In a Type (1) model, the probabilities 
of missing time-1 or time-2 information for cell (4, /) is the 
same for each domain within a frame, i.e., 6/,, = Of, = 
bis Vita = Witar = Var ie = Viias = %ae and Wing = 
Wiab = Vip. In this type of model, we estimate the ’s 
and w’s separately from each sample. It might be consid- 
ered when the samples from the two frames are collected 
using different modes. For example, if the frame 4 sample 
is a mail survey and the frame B sample is a cell phone 
survey, one might expect different probabilities of dropout 
from the two samples. 

In a Type (2) model, the probabilities of having missing 
data are the same in each domain, i.¢., bjia, = Vian = Onan 
This type of model might be considered when nonresponse 
is expected to be related to the cell membership, and frame 
membership is thought to have little effect on nonresponse. 
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For example, if the two surveys have similar types of 
designs and administrative procedures, a Type (2) model 
might be appropriate. 

For each type of model, we may need to place additional 
restrictions on the parameters in order to solve the likelihood 
equations. Following Stasny and Fienberg (1986) the 
following are possible restrictions: 


Model 1: Oy = Ayia, Wa = Ane (3) 


Model 2: 6,, = A,4, Wy =” 


Model37G7, — 7. Wan Ax 
Model 4: Oy = Ayaays War = 
Model 5: i = Aya, Wi = Acc 


Under model 1, the probability that an individual is a 
nonrespondent in a given time period depends on the given 
time period and the individual’s classification in the 
observed time period. Under model 2, the probability that an 
individual is a nonrespondent in a given time period 
depends only on the given time period. Under model 3, the 
probability that an individual is a nonrespondent in a given 
time period depends only on the individual’s classification 
in the observed time period. Under model 4, the probability 
that an individual is a nonrespondent at time | depends on 
that time period and the individual’s classification in the 
observed month, and the probability that an individual is a 
nonrespondent at time 2 depends only on the time period 2. 
Under model 5, the probability that an individual is a 
nonrespondent at time 1 depends only on the time period, 
and the probability that an individual is a nonrespondent at 
time 2 depends on the time period and the individual’s 
classification in the observed month. Many other models are 
possible in addition to these five models for each type. 
Using the derivative matrices, it is easily shown that 
Models 1-5 are all identifiable. 

In general, we will not have closed form solutions for the 
parameter estimates and the parameters must be estimated 
using an iterative method. We use the function ‘nlm’ in R 
(www.t-project.org) to calculate parameter estimates; the 
code is available from the authors. 


3.3. Estimators from complex samples 


When either or both samples are collected with a com- 
plex design, using the cell counts directly in the likelihood 
in (2) will give estimators that are not design-consistent. 
Skinner and Rao (1996) used a pseudo-maximum likelihood 
(PML) method to obtain design-consistent estimators in 
cross-sectional dual frame surveys. They showed that, 
unlike the estimators of Hartley (1962) and Fuller and 
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Burmeister (1972), the PML estimators for different re- 
sponse variables used the same set of modified weights and 
thus were internally consistent. 

We propose to study estimators inspired by the PML 
method for gross flows in dual frame longitudinal complex 
surveys that allow for missing data at either time period in 
either sample. The basic idea is to use a working assumption 
of a multinomial distribution from a finite population to give 
the form of the estimators and use a design effect to adjust 
the cell counts to reflect the complex survey design. 

In the simple random sampling case, x;,,/n, is a design- 
consistent estimator of Q;),. To obtain a pseudo-likelihood 
for general sampling designs, we replace xj,/n, by 
X;1,/N,, a design-consistent estimator of QO, under the 
complex sampling design, in the likelihood (2). Define 
¥4, =7n,X¢,/N, and x2, =1,X;,,/N,, where, follow- 
ing Skinner and Rao (1996), we allow 7, and 7, to be 
arbitrary constants. Note that if N, or N, is unknown, it 
may be estimated by N, or N, instead. 

The pseudo-likelihood has the same form as (2), with 
Xie peated ye replaced byaea pac, paray Faucerigs 
respectively. Iterative procedures are then used to find the 
pseudo-MLEs of the quantities of interest p,,,, 0, w and 
N.,,. By the fact that the pseudo-likelihood factors, N.,, is 
found to be the smaller of the roots of 


[7 + 7p | IN PML 


B A 
FA Nap, PML 


+[n,N4N, +n, N2N,)=0. (4) 


= _ ee Sehr ee > 
= lip Nee, NG INGE il 


In a complex survey, particularly when clustering is 
involved, the actual sample sizes n, and n, do not 
necessarily reflect the relative amounts of information from 
the samples. We thus suggest taking 7, and 7, to be the 
effective sample size for each sample, with 7, = n,/ 
(design effect of S,) and 7, = n,/ (design effect of S, ). 
The design effect of an estimator fi is the ratio 


[V (f1) from complex survey design ] 
[V (ti) from SRS of same size] 


The design effect is usually different for different 
variables. For estimating gross flows, however, the only 
estimators used from the component surveys are estimated 
cell counts, and we might expect that in many surveys the 
design effects for the estimators X;/, would all be similar, 
and would pe be similar to the design effect of the 
estimator Ni, We thus, as in Skinner and Rao eee): 
suggest ae the design effect for the estimator NA in 
determining 7,, and the design effect for the estimator We 
in determining 7,. If the design effects of the other 
variables are indeed identical, then the resulting PMLEs will 
minimize the variances of the estimated quantities; if they 
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differ, the PMLEs will not be optimal but they will be 
consistent and in most situations will be close to the optimal 
values (Lohr and Rao 2006). If the design effect for Nj, is 
unavailable, as would occur, for example, if the survey were 
poststratified to N4, then we suggest using a generalized 
design effect, computed by taking an average or weighted 
average of design effects from other variables in the survey. 


4. Properties of the estimators 


In this section, we will investigate properties of the 
estimators. We derive asymptotic variances, discuss jack- 
knife variance estimators, and perform a small simulation 
study to explore the properties. 


4.1 Properties 


We consider the general case in which stratified multi- 
stage samples are taken from each frame. The estimators of 
population totals are the standard Horvitz-Thompson or 
Hajek estimators from Se surveys. From frame A, the 
parameter vector 1, = [(Q“y, NaplN 4 ]’ is estimated by 
1 = [(0*)', N4/N J, where 04, = X4,/N,; 7 


=[(Q7), Na/Ne} is estimated by A, = (Q? yi 
VEIN pI with Wy = AGN 


WV b 


Theorem |: Let h = (1, T,)’ and n = (mn), nz). Assume 
that the regularity conditions on the inclusion probabilities 
in Isaki and Fuller (1982) hold for each sample. Let 7, and 
fi, be the number of primary sampling units in frames A 
and B, respectively, and let 7 = 7, + f#,. Assume that 77, 
and 7, both increase such that 7,/7, + y for some 
0 <y <1. Then f isconsistent for n, and 


a’? (A - n)—*> NCO, 2), (5) 


where Z is a block-diagonal matrix with blocks XZ, and 
X,, &, 1s the asymptotic covariance matrix of ca) , and 
i, is the asymptotic covariance matrix of f'/*#,. If, in 
addition, it is assumed that N.,/N > « for some 
0 <« <1 and that the model is identifiable, then 6 is 
consistent for 6, where 0, the parameter of interest, 
consists of components of p, V,,/N, @ and w, and 6 is 
the pseudo-maximum likelihood estimator of 6. Further- 
more, 7° (8 — 0) is asymptotically normal with mean 0 
and asymptotic variance H,2,H’, +H,2,H, where 
H_,, is the derivative matrix of the function @ with respect 
to the parameters n, for frames F € {A, B}. 


Proof. With gross flows, observed values of all variables are 
0 or 1. Thus the boundedness conditions in Lemmas | and 2 
of Isaki and Fuller (1982) are met, and the estimators of 
frame A are consistent and asymptotically normal with 


= 10, a) 
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The same argument applies to give consistency and 
asymptotic normality for the vector of estimators from 
frame B, with 


ay (Ay — Ng) — N[0, (1 — (y/C. + y))) Za) 


Combining these two asymptotic results, and using the 
independence of the sampling designs along with Slutsky’s 
theorem, gives (5). The limiting distribution of '’?(6 — @) 
follows by the delta method, since the parameters in @ are 
all twice continuously differentiable functions of those in 
yn. Since the parameter estimators cannot always be defined 
explicitly as a function of other statistics from the sample, 
we may derive the matrices H, and H, by linearizing 
the score equations (Binder 1983). The assumption that 
N.,/N — « € (0,1) guarantees that the linearization is 
well-defined. 

Theorem | shows that linearization can be used to 
estimate the variances of parameters of interest. In many 
situations, however, the matrices H, and H, are high- 
dimensional and the linearized variance estimators have 
complex form. A practical way to estimate the variances of 
the estimators is to use the jackknife estimator proposed by 
Lohr and Rao (2000). Under the regularity conditions in 
their Theorem 4, the jackknife and linearization variance 
estimators are asymptotically equivalent. The form of the 
jackknife variance estimator is Vx (0) = v, (0) + v, (6), 
where v, is a jackknife estimator obtained by deleting one 
primary sampling unit at a time from frame A while using 
the full data set for frame B, and v, is a jackknife 
estimator obtained by deleting one primary sampling unit at 
a time from frame B while using the full data set for 
frame A. 


4.2 Simulation study 


Theorem | shows that the dual frame estimators are 
consistent for the corresponding population quantities under 
the modeled missing data mechanism. We performed a 
small simulation study to investigate properties for moderate 
sample sizes with overlapping frames. We generated the 
data following the simulation study in Skinner and Rao 
(1996), with y, =N,/N and y, =N,/N. A cluster 
sample from frame A was generated with 7, psus and m 
observations in each psu, and a simple random sample of 
n, Observations was generated for frame B. We generated 
the clustered binary responses for the sample from frame A 
by generating correlated multivariate normal random 
vectors and then using the probit function to convert the 
continuous responses to binary responses. 

After generating the sample, we calculated the estimators 
of the probabilities of the union of frame A and frame B, 
average of the absolute value of the bias and empirical mean 


ue) 


squared error (EMSE) under different settings. The EMSE 
of a given estimator, Y is calculated as: 
lene ) 
EMSE = Ral, AYN (6) 
1) 


h 


where Y, is the value of Y for the r" simulation run. In 
our simulation study, we used R = 100. 

The simulation study was performed with factors: (1) 
y,: 9.2 or 0.4, (2) y,: 0.2 or 0.4, (3) clustering parameter 
p: 0.3, (4) missing data mechanism: the probability that an 
individual is a nonrespondent in a given month depends on 
the time period and the individual’s classification in the 
observed period; or missing completely at random, (5) 
amount of missing data: close to 10% or close to 20%, (6) 
sample sizes: 7,: 10, 100 or 500; m:5, nz: 100, 1,000 or 
5,000. All runs used probability parameters p,: (0.3, 
O10, 04 )aap (0:3, Onl 20.1505), sand. (p;2s.(0:4;0.1, 
0.1, 0.4). Table 3 shows the results of the simulation study 
with missing data generated under Model | and fitted with 


both Model | and the model using complete records only. 


Table 3 

Results from the simulation study for missing data generated 
under Model 1. Case (1) fits the correct model: Model 1; Case 
(2) uses complete records only. Bias is the average absolute 
bias for the population gross flow proportions p,,; EMSE is 
the average empirical mean squared error for the p,,; the 
proportions used to generate the missing data are A(,_1)9 = 
0.141, A(,_1), =9.070, A(,9 = 0.137 and i/,), = 0.068. Here, 
ni, is the number of psus in sample 4 with psu size 5 and np 
is the number of elements in sample B 


TAL Poo Po Pro Puy 
10 100 Estimator 0.311 0.120 0.149 0.420 
Case | Bias 0.040 0.029 0.029 0.040 
EMSE 0.002 0.001 0.001 0.002 
1-10) Aya) 40) dit) 
Estimator 0.159 0.095 0.146 0.094 
EMSE 0.001 0.001 0.002 0.001 
10 100. Estimator 0.286 0.120 0.146 0.448 
Case 2 Bias 0.048 0.029 0.029 0.041 
EMSE 0.004 0.001 0.001 0.002 
100 =1,000 Estimator 0.321 0.092 0.138 0.449 
Case | Bias 0.015 0.011 0.009 0.015 
EMSE 3.337e-04 1.798e-04 1.418e-04 3.256e-04 
\+-1(0) A101) A1(0) it) 
Estimator 0.145 0.074 0.123 0.068 
EMSE 2.642e-04 9.389e-05 3.917e-04 8.206e-05 
100. =1,000 Estimator 0.293 0.092 0.135 0.480 
Case 2 Bias 0.0280 0.011 0.010 0.040 
EMSE 0.001 1.839e-04 1.711le-04 0.002 
500 5,000 Estimator 0.321 0.093 0.135 0.452 
Case | Bias 0.006 0.008 0.007 0.012 
EMSE 4.960e-05 7.162e-05 6.381e-05 1.857e-04 
4-10) 7-111) 10) dt) 
Estimator 0.140 0.071 ORI25 0.064 
EMSE 4.466e-05 1.818e-05 2.288e-04 3.545e-05 
500 5,000 Estimator 0.292 0.092 0.132 0.483 
Case 2 Bias 0.028 0.008 0.008 0.043 
EMSE 8.265e-04 7.642e-05 9.571e-05 1.906e-03 
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When data are missing at random, all models give 
estimators of the gross flow proportions p,, that are 
approximately unbiased so we do not report the results here. 
From Table 3, both the correct model and the analysis of 
complete records only produce biased estimators of the 
P,, Ss. With larger sample sizes, however, the bias persists 
in the analysis that uses complete records only, while it 
diminishes when Model 1 is fit. This example has relatively 
small probabilities of missing data. With larger amounts of 
missing data, the contrast between the estimators is more 
pronounced. 


5. Application 


In this section, we apply our results to data from the 
Survey of Income and Program Participation (SIPP) and the 
Current Population Survey (CPS) within Arizona. Both CPS 
and SIPP are longitudinal stratified multistage panel 
surveys. We treat SIPP and CPS as a dual frame survey with 
the same target population: the Arizona population 18 years 
old to 64 years old. Using information from both surveys, 
we want to model the transition probabilities of employment 
status changes from January 2001 to January 2002 of people 
between 18 years old and 64 years old. Note that, strictly 
speaking, these two surveys are not designed as a dual frame 
survey. They use different questions for the labor force 
variables. Although we recoded the variables according to 
the labor force definitions in CPS, it is possible that these 
different question wordings and orderings produce bias 
when combining the information. We use this as an example 
because a real longitudinal dual frame data is not available. 
Nevertheless, the example shows the potential gains in 
efficiency by combining the information from two surveys 
in estimating gross flows. 

Both surveys have target population the noninstitu- 
tionalized civilian population of the United States. We 
consider a subset of the population: the population in the 
labor force from 18 years old to 64 years old. So N, = 
N, = N.,, and the estimation problem is a special case of 
the theory given in Section 3. The longitudinal file for the 
2001 and 2002 SIPP (Westat 2001) uses one panel. We 
merged Wave | (where January 2001 records are stored), 
Wave 4 (where January 2002 records are stored) and the 
longitudinal weight file, in which the weights are adjusted to 
sum to the population count. Since the longitudinal panel 
weights have been adjusted for the nonresponse, we 
consider this as a no missing data case. The resulted 
weighted gross flow table from SIPP is given in Table 4. 

For the CPS, the rotation group design introduces 
partially classified data. January 2001 and January 2002 
have 50 percent of the sample in common. We use these 
50% of the data together with the partially classified data to 
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perform the analysis. The weight variable we use is a cross- 
sectional weight with cross-sectional nonresponse and 
calibration adjustments (United States Census Bureau 
2006). For individuals present in the survey for only one of 
the years, we use the weight from that year. For persons 
present in both Jan 2001 and Jan 2002, we use the average 
of the two weights. The rule that we chose the average of 
the two weights is to minimize the variance of the 
composite estimator. The population group we used is the 
18-64 age group, and we excluded persons who were not in 
that category during both years. The weighted gross flow 
table from CPS is in Table 5. 


Table 4 
Gross flow table for SIPP, in Arizona 
Jan 2002 
Employed Unemployed 
January 2001 Employed 2,491,029 73,204 
Unemployed 30,698 30,160 


2,625,091 


Table 5 
Gross flow table for CPS, in Arizona 


January 2002 
Employed Unemployed Missing 


January 2001 Employed 1,129,656 38,848 689,497 
Unemployed 41,586 8,211 36,041 
Missing 606,549 57,549 
2,607,937 


Since SIPP is considered as a no missing data case, we 
assumed ,, = YW, =0 and use a Type | model in the data 
analysis. We adjusted each weight in the CPS data by the 
factor 2,625,091/2,607,937 to reach a single population total 
between the two time periods and a single population total 
between the two surveys. The number of observations in 
SIPP (frame A) after combining January 2001 and January 
2002 are 551 and the design effect for unemployment is 
about 1.76, so 7, =551/1.76 =313. The design effect for 
unemployment in CPS (frame B) is about 1.229, so 7, = 
1,020/1.229 = 830. Because the likelihood factors, the 
estimated parameters of probabilities from the five models 
(3) are all the same. We list the estimated probabilities and 
the standard errors from SIPP, CPS and data combining 
these two surveys in Table 6. 


Table 6 

Estimated transition probabilities using SIPP, CPS, and the 
dual frame method with SIPP and CPS. Standard errors are 
given in parentheses 


Poo Poi P10 Pu 
SIPP 0.9489 0.0279 0.0117 0.0115 
(0.0124) (0.0093) (0.0061) (0.0060) 
CRS 0.9088 0.0454 0.0353 0.0106 
(0.0100) (0.0072) (0.0064) (0.0035) 
SIPP and CPS 0.9230 0.0381 0.0262 0.0127 
(0.0080) __ (0.0058) (0.0050) __ (0.0030) | 
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Due to confidentiality issues, no clustering information is 
available in the CPS public-use data sets. We used a product 
of the published design effect and the variance from 
multinomial sampling to estimate the variances from both 
SIPP and CPS data. The result from Theorem | was applied 
to estimate the variances of p, for k, / =0,1. In this 
special situation, the variance estimate from the combina- 
tion of the two data sets is reduced to (7,/(7, + 7%,)) 
V,+(1,/(A, +7,)) Vz, where V, denotes the variance 
estimate from SIPP data and V, denotes the variance 
estimate from CPS data. Table 6 shows that the standard 
errors are reduced by using the dual frame method. 

We also performed goodness-of-fit tests, developed in Lu 
(2007), for the five models in (3). The parameter estimates 
from the five models and results from the goodness-of-fit 
tests, are listed in Table 7. All five models fit the data well, 
sO we recommend adopting the simplest model, Model 3, 
for the data. 

Table 7 
Estimated parameters and results of goodness of fit tests 


Estimated Parameters df Corrected G’ p-value 


Modell Agay Araay Aro) Aray 3 3.03 0.39 
0.246 0.395 0.277 0.302 

Model2 4, ry 5 8.58 0.12 
0.255 0.278 

Model3 Ag i 5 6.61 0.25 
0.262 0.353 

Model 4. Ayavoy Avagy Ae 4 4.10 0.39 
0.246 0.397 0.278 

Model5 A,» Ang)  Anaay 4 6.74 0.15 
0.255 0.277 0.313 


With the limited information available on the public-use 
data sets, we used simple weight adjustments to make the 
estimated population counts consistent with known totals. 
The SIPP and CPS weights in the data sets have already 
been calibrated and adjusted for nonresponse, so that the 
models for missing data mostly reflect the rotating panel 
design rather than attrition due to moving and other 
activities that might be related to employment status. 

Future research on these models might include using 
different weighting adjustments for the longitudinal surveys. 
In addition, different parameters could be used to distin- 
guish observations that have partial classification because of 
the rotating panel design, and observations that have partial 
classification because of nonresponse. To do so, we could 
introduce a Markov Chain model similar to the one 
proposed by Stasny (1987). In the complete data model, 
individuals are allocated to the table according to a single 
multinomial distribution. At the second step of the process, 
which is also unobserved, each individual may be chosen to 
either rotate out of the sample after the interview for month 
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t — 1 or rotate into the sample before the month ¢ interview 
according to the sampling plan. Finally, in the third step of 
the process, each remaining individual may either lose its 
row classification or lose its column classification by other 
reasons. Using this model, we can model the nonresponse 
at both times (i.e, lose both the row and the column 
classifications). 


6. Conclusions 


In this article, we developed statistical methods for 
estimating gross flows from dual frame surveys. These 
methods are necessary to estimate changes in poverty status 
or employment status over time. We developed pseudo- 
maximum likelihood estimators that use the dual frame 
structure and the properties of the two survey designs. Our 
models also account for effects of missing data when an 
individual drops out of the survey or when a rotation panel 
design is used, so they allow full use of partial information 
that may be provided by some households. We use a 
jackknife method to estimate the variance of estimators and 
examine the properties of the estimators. The results have 
been applied to real datasets. 

In this paper, the categories of the gross flow tables are 
defined independently from the sample outcomes. It is also 
possible to define the categories based on values that depend 
on the sample. For example, in social surveys, the poverty 
line might be defined using a percentile from the sample and 
the categories defined as “Below the poverty line” and 
“Above the poverty line.” Methods from this paper can be 
used to estimate gross flows if the category definitions 
depend on the sample, but the variance estimators need to 
account for the effect of estimating the category boundaries. 

Although the results in this paper are for dual frame 
surveys, the methods are general and could be extended to 
more than two surveys using PML estimators developed in 
Lohr and Rao (2006). As the number of frames increases, 
however, so does the complexity of possible missing data 
mechanisms. Misclassification error may also be more 
prevalent with a larger number of frames. 

Our research is done in the context of survey sampling, 
but it also applies to other settings in which data could be 
combined from two independent sources. As it becomes 
increasingly difficult for a single survey to cover the entire 
population of interest, we believe these methods for 
estimating gross flows can provide better coverage of the 
population with less expense. They also allow for 
supplementing a general population survey with surveys of 
specific subpopulations of interest. 
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Bayesian penalized spline model-based inference for finite population 
proportion in unequal probability sampling 


Qixuan Chen, Michael R. Elliott and Roderick J.A. Little ! 


Abstract 


We propose a Bayesian Penalized Spline Predictive (BPSP) estimator for a finite population proportion in an unequal 
probability sampling setting. This new method allows the probabilities of inclusion to be directly incorporated into the 
estimation of a population proportion, using a probit regression of the binary outcome on the penalized spline of the 
inclusion probabilities. The posterior predictive distribution of the population proportion is obtained using Gibbs sampling. 
The advantages of the BPSP estimator over the Hajek (HK), Generalized Regression (GR), and parametric model-based 
prediction estimators are demonstrated by simulation studies and a real example in tax auditing. Simulation studies show 
that the BPSP estimator is more efficient, and its 95% credible interval provides better confidence coverage with shorter 
average width than the HK and GR estimators, especially when the population proportion is close to zero or one or when the 
sample is small. Compared to linear model-based predictive estimators, the BPSP estimators are robust to model 


misspecification and influential observations in the sample. 


Key Words: Bayesian analysis; Binary data; Penalized spline regression; Probability proportional to size; Survey 


samples. 


1. Introduction 


Unequal probability sampling designs are commonly 
employed in data collection by science and government. 
Perhaps the simplest unequal probability design is stratified 
sampling, which samples units from different strata with 
different inclusion probabilities. Another important form of 
unequal probability sampling is probability-proportional-to- 
size (pps) sampling, in which the inclusion probability is 
proportional to the value of a size variable measured for all 
population units. 

An unequal probability sampling design such as pps 
sampling is often used for efficient estimation of population 
means of continuous variables, for which the variance 
increases with size of unit. However, inferences about 
discrete variables are often also of interest in a multipurpose 
survey (e.g., Lehtonen and Veijanen 1998, Lehtonen, 
Sarndal and Veijanen 2005). In this paper, we focus on 
methods of inference for finite population proportions from 
unequal probability sampling designs, based on an auxiliary 
variable measured for all the units in the population. We use 
pps sampling as a specific design to illustrate and assess our 
methods. 

The inclusion probabilities play important and somewhat 
different roles in design-based and model-based inference 
from unequal probability survey samples (Smith 1976, 1994; 
Kish 1995; Little 2004). In design-based inference, survey 
variables are fixed, and inference is based on the distribution 
of the sample inclusion indicators; the standard design-based 
approaches to estimation such as the Horvitz-Thompson 


(HT) estimator (1952) and its extensions weight sampled 
units by the inverse of their inclusion probabilities. These 
estimators are design consistent (Isaki and Fuller 1982) and 
provide reliable inferences in large samples without the need 
for modeling assumptions. However, these estimators are 
potentially very inefficient, as illustrated in Basu’s (1971) 
famous elephant example. Also, variance estimation is cum- 
bersome because it requires second-order inclusion proba- 
bilities. Corresponding confidence intervals are based on 
asymptotic theory, and may deviate from nominal levels for 
moderate or small sample sizes. 

Model-based inference predicts values of survey vari- 
ables in the non-sampled units by including the inclusion 
probabilities as covariates in the prediction model (Little 
2004). Model-based prediction estimators are consistent and 
efficient under the assumed model, but are subject to bias 
when the underlying model is misspecified. This limitation 
motivates the development of flexible statistical models that 
are more robust to model misspecification. For continuous 
survey data, Zheng and Little (2003) estimated the finite 
population total using a nonparametric regression on a 
penalized spline (p-spline) of the inclusion probabilities. We 
propose here Bayesian P-Spline Predictive (BPSP) esti- 
mators that are suitable for a binary, as opposed to contin- 
uous, outcome. We adopt a Bayesian approach to inference 
for this model, since Bayesian methods often yield better 
inference for small sample problems, and are conveniently 
implemented for our proposed model via the Gibbs’ 
sampler. In this approach, auxiliary variables other than the 
inclusion probability can also be included in the model, but 
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the inclusion probability is singled out since modeling of 
this variable is prone to model misspecification. 

We compare the performance of BPSP estimators with 
Hajek (HK, Horvitz-Thompson-type) estimators and with 
Generalized Regression (GR) estimators for a binary out- 
come proposed by Lehtonen and Veijanen (1998). The GR 
approach is a popular model-assisted modification of the 
design-based estimators that combines predictions from a 
model with design-weighted model residuals (Montanari 
1998), to yield estimates that are approximately design 
unbiased. 

Zheng and Little (2003; 2005) compared HT, p-spline 
prediction, and GR estimates of the total of a continuous 
survey variable by simulation. They found that p-spline 
model-based estimators had better root mean squared error 
than the other methods, and with jackknife standard errors 
providing superior confidence coverage to HT or GR 
inferences. We conduct similar comparisons for inference 
about a population proportion for a binary outcome, and 
show similar advantages for our BPSP estimator over the 
HK and GR alternatives. 


2. Design-based estimator 


Suppose that we have a finite population consisting of VV 
identifiable units. Let Y be the binary survey variable of 
interest and p=N'>%.Y, be the proportion of the 
population for which Y = 1. Let 1, denote the probability 
of inclusion for unit 7, which is assumed to be known for all 
units in the finite population before a sample is drawn. An 
unequal probability random sample s with elements 
y),---¥, 18 then drawn from the finite population according 
to the inclusion probabilities 7,,...,,. The design-based 
HK estimator in the discussion of Basu (1971) is defined as 


ye: /T; 


A = ies ; ] 
Pux S/n, (1) 


ies 


The variance for p,, can be estimated via linearization of 
the Yates-Grundy estimator (1953) of totals, 


Val Pa) =p [Ty ) 


kes 


n 


3 as Viz Pu og an : 
2) 


J=it+l TU TT; TU; 


=I 
i=l 
The Yates-Grundy variance estimator requires pairwise 
inclusion probabilities. When the pairwise inclusion proba- 
bilities are not available, as in our simulations, the approxi- 


mate formula proposed by Hartley and Rao (1962), 


Statistics Canada, Catalogue No. 12-001-X 


n—-\ 


2 


N 
2 2 n> | 2 
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has frequently been used. An approximate 1—a level 
confidence interval for the population proportion p,,, 1S 
then obtained based on the normal approximation. 


3. Bayesian P-Spline Predictive (BPSP) estimator 


Royall (1970) argued for the use of models for finite- 
population descriptive inferences by predicting the un- 
observed values based on models, since model-based 
inferences should be more efficient than design-based 
inferences. To model the relationship between the binary 
outcome Y and the continuous inclusion probability =, we 
need to fit a binary regression of Y on z. Parametric binary 
regressions, such as the linear or quadratic logistic or probit 
model, may not be adequate in fitting the data. One solution 
for this problem of inflexibility is to fit a binary regression 
on a spline of = by adding some knots. However, too many 
knots may result in the roughness of model fit. One way to 
overcome this problem is to retain all of the knots but to 
constrain their influence, by fitting a binary p-spline 
regression model. 

Common methods for modeling a binary outcome are 
logistic and probit regressions, and they generally give 
similar results. We choose to adopt probit models in our 
study for computational convenience. The probit regression 
model for binary outcomes has an underlying truncated 
normal regression structure on latent continuous data. If the 
latent continuous data are known, the parameters in binary 
p-spline regression models can be estimated using standard 
approaches for normal p-spline regression models. In a 
Bayesian context, the posterior distribution of parameters in 
the probit p-spline model can be computed using Gibbs 
sampling (Albert and Chib 1993; Ruppert, Wand and 
Carroll 2003, chapter 16). In contrast, the logistic p-spline 
regression model requires a more complicated computation 
procedure such as the Metropolis-Hastings algorithm. The 
computational advantage makes the probit link function 
more desirable than the logit link function in Bayesian 
binary p-spline regression models. 

There are various types of p-splines. When applying p- 
splines, we need to make choices on the degree and knot 
locations, and the basis functions used to present the model. 
We choose to use the truncated polynomial p-splines 
because they are simple and intuitive. More numerically 
stable estimators can be obtained using B-splines via 
orthogonalizing the truncated power bases (Eilers and Marx 
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1996). The probit truncated polynomial p-spline regression 
model has a generalized linear mixed model representation, 


@7(E(y;|B,5,0,)) = By + > Byns+ > b(n, =)? GB) 


k= tel 


b, ~ N(0,17) 


where @~'(-) denote the inverse CDF of a standard normal 
distribution, and the constants k, <...< k, are m selected 
fixed knots. A function such as (n,—k)? is called a 
truncated polynomial spline basis function with power p, 
where (u)? is equal to {wx J(u =0)}” for any real 
number u. Since the truncated polynomial spline basis 
function has p —1 continuous derivatives, higher values of 
p lead to smoother spline functions. By specifying a 
normal distribution for b, the influence of the m knots is 
constrained in Model (3), which is equivalent to smooth the 
splines via the penalized likelihood. 

The parameters in Model (3) can be estimated using 
generalized linear mixed model methods. An alternative 
Bayesian approach that simplifies computation is to assume 
weak prior and hyperprior distributions and use Gibbs 
sampling to obtain draws from the posterior distributions of 
the parameters as follow: the probit regression model for 
binary responses has an underlying normal regression 
structure on latent continuous data; if the latent data are 
known, the posterior distribution of the parameters can be 
computed using standard results for normal regression 
models; and given the posterior distribution of the para- 
meters, the latent continuous data can be simulated from a 
suitable truncated normal distribution. (Ruppert et a/. 2003, 
page 290) The detailed algorithm of Gibbs sampling is in 
the Appendix. In addition, the Bayesian inference for p- 
spline regression can also been implemented using 
WinBUGS, the standard Bayesian analysis software 
(Crainiceanu, Ruppert and Wand 2005). 

The posterior distribution of the population proportion is 
simulated by generating a large number D of draws and 
using the predictive estimator form poly Ook, oe 
Lies DY), where $\ is a draw from the posterior 
predictive distribution of the j" non-sampled unit of the 
binary outcome. The average of these draws simulates the 
Bayesian P-Spline Predictive (BPSP) estimator of the finite 
population proportion, and is denoted as Pypgp, Where 


D 
Pepsp Dip adore (4) 
d=l 


The Bayesian analog of a 100 x (1—a)% _ confidence 
interval for the population proportion is a 100 x (1 — a)% 
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credible interval, which can be formed in a number of 
different ways. We split the tail area a equally between the 
upper and lower endpoints in the simulations. 

Firth and Bennett (1998) showed that any parametric 
logistic regression model containing an intercept term and 
the inverse of inclusion probabilities as a covariate, fitted by 
ordinary, unweighted maximum likelihood, was “internally 
bias calibrated” (IBC) for population proportions, and thus 
yields design consistency. This property is also true for 
logistic truncated polynomial p-spline regression models on 
the inverse of inclusion probabilities, fitted via penalized 
likelihood. With the probit link function used instead of the 
logit link function and fitted via Markov chain Monte Carlo 
algorithm instead of maximum penalized likelihood, the 
BPSP estimator may no longer have the IBC property. 
However, the similarity between the probit model and the 
logistic model implies that the predictive estimator based on 
a probit p-spline regression model is approximately design- 
consistent. We believe that obtaining efficient estimates 
with close to nominal confidence coverage in finite samples 
is more important than exact design consistency. 


4. Generalized Regression (GR) estimator 


For the estimation of class frequencies of a discrete 
response variable, Lehtonen and Veijanen (1998) proposed 
a GR estimator f,, of the total, which combines the 
predicted values }, = Pr(¥, =1|7,) based on a suitable 
model and the HT estimator for the residuals 7, = y, — J, 
of the sampled units, 


N 

lovee Vie ais (5) 
i=1 ies 

The GR estimator in Equation (5) is then used in constructing 

an estimator for population proportions by dividing by the 

known population size N (Duchesne 2003), 


N 
Por i= a{Ss, a ee in} (6) 
ri N i=! ies 
We also consider here another version of the GR 
estimator for the estimation of finite population proportions, 
in which the denominator of the bias calibration term for the 
residuals 7, is the estimated population size >), 1/7,, 


te és + (Zain \(L a)". (7) 


ies ies 


For the variance estimate of (6), we use the variance 
estimator of the estimated total of a discrete response 
variable, given by Lehtonen and Veijanen (1998), divided 
by N°. For the variance estimate of (7), we apply the 
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Taylor linearization technique (Sarndal, Swensson and 
Wretman 1992, page 182). These two variance estimators 
are shown in equations (8) and (9), respectively. 


ae l in = a ee WP 
V (Per 1) = pd Pea Tee (8) 


kes les Tp; Ty Ty 


A 3 hel PARLE Nap apa ane am) 


ies kes les Ty) Ty, Ty 


where. @. =% i= set Te) Daze 1/n,)'. These variance 
estimators also require pairwise inclusion probabilities, 
which can be approximated by the method of Hartley and 
Rao (1962). 

However, the Hartley and Rao approximation may lead 
to bias in the variance estimator. Thus, we also consider the 
jackknife method for variance estimation (Shao and Wu 
1989). The sample is stratified into »/G strata each of size 
G with similar values of inclusion probabilities, and the G 
subgroups are then constructed by selecting one element at a 
time from each stratum without replacement (Zheng and 
Little 2005). Let p,,, be the same GR estimators in (6) and 
(7) calculated from the reduced sample without the elements 
in the g™ subgroup, and let p be the average of the G 
estimators based on the G reduced samples. The jackknife 
variance estimator of Pop 1S 


fy A G eS l 2 x =) 
V iackknite (Par ) = —— (h,) =e (10) 


A design-weighted logistic regression model on other 
covariates was used as the assisting model to predict , in 
the GR estimators for binary outcomes (Lehtonen and 
Veijanen 1998; Lehtonen etal. 2005). Since our interest 
here is in comparisons of GR estimators with the BPSP 
estimator, we apply the estimators (6) and (7) with linear 
probit regression models and probit p-spline models, as 
described in detail in Section 5. For the GR estimator using 
a linear probit model as the assisting model, we use the 
inclusion probability as a covariate as well as a weight in 
our simulations. 


5. Simulation study 


5.1 Design of the simulation study 


Simulation studies are conducted to study the perfor- 
mance of the BPSP estimator compared with the HK 
estimator, the GR estimators, and the linear model-based 
predictive estimators for a variety of populations in pps 
sampling. We present the simulation results for the 
following six estimators: 
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a) HK, the Hajek estimator defined by equation (1). 

b) LR, predictive estimator of the form p,p=N' 
(Dies ¥; + Lies I;*) with prediction 5)" obtained 
with the maximum likelihood predictions from the 
linear logistic regression model containing a constant 
term and the reciprocal inclusion probability as the 
covariate. LR has the IBC property, and hence is 
design-consistent. LR is exactly the same as its GR 
estimator in equation (6). 

c) PR, predictive estimator of the form ppp =N | 
(Xies¥; + Lies 9, *) with prediction p*" from the 
Bayesian linear probit model containing an intercept 
term and the inclusion probability as the covariate. 

d) PR_GR, the GR estimator in equation (7), where y, 
is the prediction for unit 7 with unknown parameters 
replaced by weighted maximum likelihood estimates 
from the probit model with a constant term and the 
inclusion probability as the covariate. 

e) BPSP, the BPSP estimator defined by equation (4) 
with p = 1 and inverse-gamma prior distribution for 
t and using 15 knots. 

f) BPSP_GR, the GR estimator in equation (7), where 
y, is the posterior mean of Pr(¥, = 1| 7,) from the 
BPSP model. 


We only report the simulation results based on the linear 
splines for the BPSP estimator, since simulations not shown 
here suggest that linear splines perform as well as quadratic 
splines or cubic splines in all the simulation scenarios. We 
choose two fixed numbers of knots (15 or 30), and place 
knots at evenly spaced sample percentiles. The choices of 
knots work well and a number of 15 knots is good enough 
to catch the curvatures in our simulations. In addition, the 
GR estimators in (6) perform similarly to the estimators in 
(7); some differences between these estimators emerge in 
the real application in Section 6, leading us to prefer (7) 
over (6). 

We simulated two artificial populations of size 2,000, 
using two different distributions, with sampling rates of 5% 
and 10%, where the size variable takes the consecutive 
integer values 71, 72, ..., 2,070. The inclusion probabilities 
in the population were then calculated as proportional to the 
size variable, with the maximum value about 30 times the 
minimum values. 

Continuous data Z were first generated from normal 
distributions with mean structure /(7) and constant error 
variance 0.04. Two different mean structures /(m) were 
simulated: a linearly increasing function (LINUP) /(2,)= 
k,x, and an exponential function (EXP) (/(z,) = 
exp(—4.64 + k,m,). To make the range of Z_ similar 
across different mean structures, k, takes values of 3 and 6, 
and k, takes values of 26 and 52, when the sampling rate is 
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10% and 5%, respectively. Figure | plots the two 
populations. We then generated the binary outcome variable 
Y,, where Y, is equal to one if Z is less than or equal to its 
superpopulation 10" percentile, otherwise Y, is equal to 
zero. Similarly, we generated the binary outcomes Y, and 
Y , by using the superpopulation 50" and 90" percentiles of 
Z as cut-off values. The target of inference here is the 
population proportion with Y equal to one. 

In each simulation replicate, a finite population was 
generated before a sample was drawn, and the true finite 
population proportion with Y equal to one was calculated 
and denoted as p. A pps sample was then drawn 
systematically from a randomly ordered list of the finite 
population. For each population and sample size 
combination, 1,000 replicates were obtained and the six 
estimators were compared in terms of empirical bias, root 
mean squared error (RMSE), and the non-coverage rate of 
the 95% confidence /credible interval. Simulation results are 
presented in Tables 1 through 3. Let p, be an estimate of 
p, based on the i pps sample, the empirical bias and 
RMSE are defined as follow, 


1,000 


l 
1,000 7000 2 (P: ~ Pi) 


Bias = 


RMSE = 


5.2 Simulation results 


Figure 2 shows the posterior means of Pr(Y, = || 7) 
and 95% credible intervals based on the Bayesian probit 
linear p-spline model for a random pps sample from the 
EXP case. The upper left plot is the scatter plot of the 
continuous variable Z in a pps sample, with three 


LINUP 


0.00 0.05 0.10 0.15 0.20 
Inclusion probability 
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horizontal parallel lines superimposed, representing the 
superpopulation 10", 50", and 90" percentiles, respectively. 
In the upper right plot, the binary variable Y, defined as | if 
Z is less than or equal to the superpopulation 10" 
percentile, are plotted with black circles, and the 
superpopulation Pr(Y, =1|7,) are plotted with a solid 
black curve. The solid grey curve and two dashed grey 
curves are the posterior means of Pr(¥, = 1| 7,) and 95% 
credible intervals based on the Bayesian probit linear p- 
spline regression model. The other two plots are similar to 
the upper right plot, but with superpopulation 50" and 90" 
percentiles as cut-off values in defining Y. These plots 
show that the true probabilities of Y =1 fall within the 
95% credible intervals, and are close to the posterior means 
of Pr(Y, = 1| 2,). We conclude that the Bayesian probit p- 
spline regression model fits well for the binary outcomes in 
the nonlinear case. 

Table 1 shows the empirical bias (x10°) for the six 
estimators in the two populations generated via LINUP and 
EXP. Overall the design-based estimators (a, d, and f) are 
less biased than the model-based estimators (b, c, and e). In 
the LINUP case, the linear probit regression model is 
correctly specified, so that the empirical bias of the PR 
estimators are similar to the empirical bias of the BPSP 
estimator; while in the EXP case, a nonlinear probit 
regression is needed to fit the data, and thus the PR 
estimator is more biased than the BPSP estimator when the 
true population proportions are 0.1 and 0.5. However, the 
LR estimator has similar to the BPSP estimator empirical 
bias because of the IBC property. Compared to the model- 
based PR and BPSP estimators, the PR_GR and BPSP_GR 
estimator reduce the bias by adding the bias calibration 
term. Moreover, no matter which assisting models were 
used, both GR estimators achieve similar empirical bias. 


EXP 
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Figure 1 Two simulated artificial populations (NV = 2,000) 
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Value of Z 
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Figure 2 A random pps sample from the EXP case (n = 200, N = 2,000): (a) scatter plot of Z; the three 


grey lines are the superpopulation 10", 50", 


and 90" percentiles, respectively. (b) black 


circles are observed units of binary survey variable Y in the sample, defined as Y = J (Z < 
10" percentile); the grey solid and dashed curves are posterior means of Pr(Y;= 1|7;) and 
95% credible intervals, respectively, simulated based on a probit p-spline model on 7; and 
the black curve is the superpopulation Pr(¥;= 1|z;). (c) similar to (b), but with Y=7(Z< 50" 
percentile). (d) similar to (b), but with Y=/(Z< 90" percentile) 


Table 1 
Empirical bias < 1,000 of six estimators (Minimum absolute bias within a row is in italic print) 
Population n True prop. HK LR PR PR_GR BPSP BPSP_GR 
LINUP 100 0.10 -0.01 13.0 10.3 1.6 8.0 1.2 
0.50 -4.0 -2.9 -4.3 -3.0 -5.2 -3.3 
0.90 -0.4 0.3 -2.5 0.3 -2.9 0.08 
200 0.10 DES TSS) 5.8 ES Sal 1.4 
0.50 3:3 -0.1 all 83 -0.06 -1.7 -0.2 
0.90 1.6 0.4 -1.0 0.3 -1.2 0.3 
EXP 100 0.10 ez, 18.1 25.8 4.7 17.0 39 
0.50 -4.0 -3.5 ES -1.6 -1.4 -3.4 
0.90 -1.3 -0.2 -1.0 -0.1 -1.0 -0.2 
200 0.10 Ball 11.0 22.1 8),5) 13.4 De) 
0.50 3.8 -0.6 14.0 0.4 0.01 -0.7 
0.90 23 0.1 -0.7 0.1 -0.7 0.02 


Table 2 shows the empirical root mean squared error 
(x10° ) for the six estimators. The BPSP estimator has much 
smaller empirical root mean squared error than the HK 
estimator, except when p is 0.1 in the EXP case. Overall 
the PR estimator performs similarly to the BPSP estimator. 
To protect again model misspecification, the GR estimators 
lose some efficiency compared to their corresponding 
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model-based predictive estimators. The PR_GR estimator 
has similar to the BPSP_GR estimator RMSE, but both of 
the two GR estimators have smaller RMSE compared to the 
HK estimator by using assisting models. 

Table 3 shows the noncoverage probability (x10°) of 
95% confidence/credible intervals, the probability that the 
true finite population proportion is outside the 95% CI of the 
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estimators. To calculate the variances of estimators, we use 
the Yates-Grundy variance estimator as defined in equation 
(2) for the HK estimator; use jackknife resampling method 
defined by equation (10) for the LR estimator; and use both 
the linearization (V1) method defined by equation (9) and 
the jackknife resampling (V2) method for the PR_GR and 
BPSP_GR estimators. Overall, the confidence coverage of 
credible interval for the BPSP estimator is closer to the 
nominal level than the other five estimators, especially when 
the population proportion p is close to zero or one or when 
few observations are selected into sample in the tails. 
Specifically, the BPSP estimator achieves significant 
improvement in coverage when p is close to zero in both 
the LINUP and EXP cases, since little data are included in 
the sample from the lower tail of the two populations. Note 
that the improved coverage of the BPSP estimator is 
achieved with intervals that are narrower on average than 
those of the HK, LR, PR_GR, and BPSP_GR estimators. 
Similar to the empirical bias and RMSE, the BPSP_GR 
does not improve the coverage in comparison to the PR_GR 
estimator by using a flexible assisting model. 

The choice of prior and hyperprior distributions in mixed 
models can have a big effect on inferences. We used a prior 
distribution N(0,10°) for the fixed effects parameters, Be 
In our simulations, we report results based on a proper 
inverse-gamma_ prior distribution for 1, namely 
t o IG(0.1,0.1). To assess sensitivity to the choice of 
prior distributions, we also computed results using 
t oc IG(0.01,0.01) and t* « IG(0.001,0.001), as well 
as an improper uniform prior distribution on t (Gelman 
2006). These different priors had little impact on posterior 
inference of the proportion of interest. 


6. Example of tax auditing 


We now compare the BPSP estimator with alternative 
methods on a real population involving income tax auditing 
data (Compumine 2007). The data set consists of 3,119 
Swedish income tax returns for persons who during the year 
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sold mutual funds managed in a foreign country. The 
outcome of interest Y is whether the income tax return is 
incorrect (coded as | for incorrect, and 0 for correct), and it 
is measured for all observations in this data set. We treated 
the 3,119 income tax returns as a finite population here, so 
that the true population proportion of incorrect income tax 
returns 1s 0.517. Since the amount of the realized positive 
profit is an important feature for determining the amount the 
tax payer has hidden from taxation for his return of income 
from the sale of a foreign fund, it was chosen as the size 
variable used in drawing pps sampling. When the primary 
measure of interest is the total amount the tax payer has 
hidden from taxation, it is reasonable to assign a value of | 
Swedish Krona to negative profits, the minimum amount of 
the positive profits, where negative values are not allowed in 
the size variable. 

One thousand repeated systematic pps samples of size 
300 and 600 were drawn without replacement from 
randomly ordered population lists. The returns with largest 
profits were included with certainty into the samples of size 
300 and 600: there were 78 and 241 such returns respec- 
tively. Figure 3 shows that the probability of inclusion has a 
right-skewed distribution for the population even after 
excluding the observations with inclusion probability of 1. 

We applied the same six estimators as in the simulation 
study with 30 knots on the pps samples, and compared their 
performances in terms of empirical bias, RMSE, and 
average width and noncoverage rate of the 95% confidence/ 
credible interval. For the BPSP estimator, a fixed number of 
30 knots are placed at evenly spaced sample percentiles of 
the inclusion probabilities. For the GR estimators, neither 
the linearization nor the jackknife variance estimator has 
predominantly better performance than the other, we present 
the inference based on the linearization variance estimator 
for simple calculation. We report the GR estimators based 
on both equations (6) and (7). The results are displayed in 
Table 4. 


Table 2 
Empirical RMSE ~* 1,000 of six estimators (Minimum RMSE within a row is in italic print) 
Population n True prop. HK LR PR PR_GR BPSP BPSP_GR 
LINUP 100 0.10 Soul SWoll 46.3 Syil2) 47.2 eviley 
0.50 65.2 50.8 47.1 49.7 47.7 50.0 
0.90 26.3 22.6 WBS) BAA IS) 22.9 
200 0.10 39.3 40.9 31.8 36.1 5220 36.2 
0.50 45.7 35.9 32.8 34.3 BAI. 34.6 
0.90 17.8 15.4 155) 15.4 [Ses Ss 
EXP 100 0.10 Sule. 60.1 54.4 51.6 Srila 52.4 
0.50 66.1 56.0 43.0 32 47.0 Sie 
0.90 24.2 12.4 5 12.4 IDES} 12.3 
200 0.10 35.9 42.4 39.6 35.6 36.0 36.2 
0.50 45.1 38.9 Biles 36.1 8251 Boul 
0.90 15.8 8.0 8.1 6.0 8.0 8.0 
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Noncoverage rate of 95% CI x 100 of six estimators (noncoverage rate within a row closest to 5 is in italic print) 


Table 3 
Population n True HK LR 
prop. 
LINUP 100 0.10 16.2 18.0 
0.50 Tes 9.4 
0.90 7.4 11.4 
200 0.10 10.8 12%6 
0.50 Se) 8.3 
0.90 6.0 8.4 
EXP 100 0.10 15.0 18.1 
0.50 7.4 [Bt 
0.90 6.1 10.5 
200 0.10 10.8 Nhe 
0.50 6.0 Wiles 
0.90 ope) 8.8 


PR PR_GR BPSP BPSP_GR 
Vi v2 Vi v2 
5.4 20.9 16.1 9.0 18.4 14.2 
5.0 ee 7.6 4.4 vie} 7.1 
5)a// 8.0 9.4 5.4 8.4 7.1 
6.4 339 10.9 2 12.6 9.4 
eS) 6.2 529 il 6.0 5) 
4.4 6.1 4.4 4.7 6.3 35) 
10.5 19.4 14.8 Oy 18.4 14.4 
ae 9.0 11.4 8.9 10.2 8.4 
12) 99 7.6 7.0 9.8 Te 
9.9 WES led Tea) 12.4 9.4 
14.3 EZ, 8.5 6.2 1: 6.9 
5) 6.8 4.6 23) 6.6 i 


* V1: variance estimator using linearization; V2: jackknife variance estimator. 


Table 4 shows that the BPSP estimator has slightly 
increased bias but smaller RMSE, shorter average width and 
closer to the nominal level credible interval than the design- 
based estimators (a), (d), and (f). Results not shown here 
indicate that the BPSP estimator with a uniform prior 
distribution has slightly better performance than that with 
inverse-gamma prior distribution with respect to empirical 
bias, RMSE, and coverage rate, because there are more 
fluctuations in the data and the uniform prior allows the 
fitted function to have more flexibility. The BPSP_GR 
estimator is less biased, but achieves less efficiency and 
worse coverage rate than the BPSP estimator. The 
predictive estimator using the probit linear regression model 
as prediction model performs poorly here since the model 1s 
misspecified, but its GR estimator does reduce bias and 
RMSE and improve coverage rate. The BPSP_GR estimator 
based on equation (6) performs very poorly in terms of 
RMSE compared to the estimator in equation (7), because a 
situation similar to that in Basu’s (1971) circus elephant 
example occurs, where one or more observations having 
very low inclusion probabilities are selected into the sample 
and hence receive large weights. However, the PR_GR 
estimator in equation (6) performs as well as that in equation 
(7) with predictions obtained from the weighted maximum 
likelihood estimates, where inclusion probability is used as a 
covariate as well as the sample weights. Overall, the GR 
estimator in equation (7) is more desirable than that in 
equation (6). As the sample size increases from 300 to 600, 
the noncoverage probability of the 95% credible interval of 
the BPSP estimator approaches the nominal level of 5% 
quickly from 14% to 5%, but the coverages are consistently 
below the nominal level for the other estimators. 

Compared to the linear model-based predictive esti- 
mators, the BPSP estimator is robust not only to model 
misspecification, but also to the influential observations in 
the sample. To demonstrate the robustness to the influential 
observations, we compare the changes in the model fitting 
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using probit p-spline models, linear probit model, and 
quadratic probit model based on the pps sample only in 
Figure 4, and based on the pps sample as well as the 
observations with inclusion probabilities of 1 in Figure 5. In 
each figure, the population is stratified by the 100 quantiles 
of the probabilities of inclusion, and the true probabilities of 
Y =1 are calculated and plotted with a black dot for each 
stratum. The grey curves are the posterior means of 
Pr(¥, =1| 2,) from 10 random pps samples using 3,000- 
iterate Gibbs sampler and linear spline in the left plot, using 
linear probit regression in the middle plot, and using 
quadratic probit regression in the right plot. Figure 4 shows 
that the probit p-spline regression model is more flexible in 
catching the pattern among the observations than the 
parametric models. From Figure 4 to Figure 5, the posterior 
means of Pr(¥, = 1|7,) do not change except for those 
with very large inclusion probabilities using the p-spline 
model. However, the posterior means curves change 
dramatically using the quadratic probit regression. These 
comparisons indicate that probit p-spline regression model 
is less likely affected by influential observations, and hence 
is a good choice of prediction model in the model-based 
inference. 


7. Discussion 


Bayesian inferences based on the p-spline model 
outperform the HK estimator, the GR estimators, and linear 
model-based prediction estimators in our simulations. The 
BPSP estimators are more efficient than the HK and GR 
estimators, and despite slightly higher empirical bias, their 
95% credible intervals provide better confidence coverage 
and shorter average interval width, especially when the 
population proportion is closer to zero or one and few data 
are selected into the sample in the tails. This suggests the 
importance of current research in estimating finite 
population prevalence of rare events. 
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The BPSP estimator is a natural extension of the regular _loss of efficiency for the sample sizes considered. Therefore, 
linear regression model-based estimators of finite popu- the BPSP estimator is easy to understand while requires 
lation proportions. Compared to linear model-based predict- | complex computation. However, with the availability of 
tive estimators, the BPSP estimator achieves robustness to | WinBUGS, the Bayesian statistical software, the BPSP 
model misspecification and influential observations in the estimator can be easily implemented by survey practitioners. 
sample by using a flexible p-spline model, without much 


Table 4 


Comparison of various estimators for empirical bias, root mean squared error, and average width and noncoverage rate of 95% 
CI, in the tax return example 


Methods 


BPSP_GRI 
BPSP_GR2 


bias* 100 RMSE*100 average width*100 noncoverage* 100 

300 600 300 600 300 600 300 600 
-2.4 -1.8 12.4 10.2 36 I} 14.1 NOL 
6.7 5,3) lle Oo Ay Dsl 43.5 45.6 
-11.6 -10.1 12.4 10.6 18 14 69.8 83.4 
-1.2 -0.4 11.5 8.7 3 oD 22.4 16.8 
-1.2 -0.3 ils) 8.8 83 26 16.1 11.4 
-6.8 -2.7 OS oe OF 19 14.2 5.0 
-3.0 -0.5 102.6 56.9 77 57 14.4 9.2 
-0.7 0.2 12.0 10.1 34 26 39) 12.8 


* GR_1: GR estimators using equation (6); 
GR_2: GR estimators using equation (7). 


1.0 


0.4 0.6 0.8 


OO 2 


n= 300 n= 600 
° f 
t & 
S 
‘© 
S 
wt 
S 1 
a ! 
S 


S 


Figure 3 Box plots of the probabilities of inclusion for two sample sizes in the tax auditing example 
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Figure 4 


Predictions based on pps samples only in the tax auditing example, X-axis: inclusion 
probabilities z, Y-axis: P(Y = 1|z); black dots are the true P(Y = 1|z) within each percentile 
of m; grey curves are ten realizations of the posterior means of P(Y = 1|z). The prediction 
models are (a) probit linear p-spline regression, (b) linear probit regression, (c) quadratic 
probit regression 
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Predictions based on the combined data of pps samples and the observations sampled with 


certainty in the tax auditing example, X-axis: inclusion probabilities 2, Y-axis: P(Y = 1|z); 
black dots are the true P(Y=1|z) within each percentile of m; grey curves are ten 
realizations of the posterior mean of P(Y=1|z). The prediction models are (a) probit 
linear p-spline regression, (b) linear probit regression, (c) quadratic probit regression 


The BPSP estimators are not sensitive to two choices of 
prior distributions of t” considered here, though it appears 
from the tax auditing example that the uniform prior yields 
slightly smaller bias and RMSE, shorter 95% credible 
intervals, and better coverage when a nonlinear prediction 
model is needed. The tax auditing example also shows that 
in the GR estimator, an estimated population size using the 
sum of inverse inclusion probabilities is more desirable than 
the true population size when one or more observations with 
very low inclusion probability are included in the sample, 
since the GR estimator with denominator N has high 
variance and low efficiency in this case. 

The design-based estimators and their 95% confidence 
intervals can provide valid inferences for population propor- 
tions when the sample is large. However, these asymptotic 
properties do not appear to hold when the sample size is 
moderate or small. The BPSP approach can provide more 
valid inferences for small samples, especially when the true 
population proportion to be estimated is close to 0 or 1, 
although confidence coverage appears to be less than 
nominal when the sample size gets small, and lack of 
parsimony of the model is an issue. When estimating 
proportions away from tails, the BPSP estimator leads to 
slightly smaller RMSE and closer to the nominal level 
confidence coverage than the HK and GR estimators, but 
the improvement is not so significant as in the tails. In this 
scenario, to avoid the complex computation of the BPSP 
estimator, the PR_GR estimator based on equation (7) is an 
alternative to the survey practitioners. 

The choice of variance estimator is problematic for some 
unequal probability designs for the design-based estimators, 
but the Bayesian p-spline prediction approach provides a 
simulation approximation of the full posterior distribution of 
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the population proportion. Extra work is not needed to 
estimate the variance or 95% credible interval for the BPSP 
estimator, as it can be obtained simultaneously with the 
point estimators. In Zheng and Little (2005), three variance 
estimators of the p-spline model-based estimator for finite 
population total in a pps sample were compared, including 
the model-based empirical Bayes variance estimator, the 
jackknife variance estimate, and the balanced repeated 
replication (BRR) variance estimate. The simulation studies 
showed that the jackknife method worked well, whereas the 
BRR method tended to yield conservative standard errors 
and the model-based empirical Bayes estimator was 
vulnerable to misspecification of the variance structure. In 
the present work, the 1— «a level credible interval for the 
BPSP estimator of population proportion is constructed by 
splitting a equally between the upper and lower endpoints 
of the posterior distribution of p. This pure Bayesian 
approach based on draws from the posterior distributions 
seems to work well in our setting and avoids the heavy 
computation associated with the jackknife and BRR 
method. 

The BPSP estimator we propose here can be extended to 
include additional auxiliary covariates by adding linear 
terms for these variables. For domain estimation, an 
interaction term between the spline of inclusion probabilities 
and the domain indicator should also be modeled. Both the 
additive effects of auxiliary variables and the interaction 
between the domain indicator and inclusion probabilities 
can be represented in a mixed model (Ruppert ef a/. 2003, 
page 231) and estimated using Gibbs sampling or 
WinBUGS (Crainiceanu et al. 2005). The BPSP estimator 
for finite population proportions can also be extended to a 
more general case of a polychotomous response. The Gibbs 
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sampling approach for the binary case can be generalized to 
the case of ordered categories, and can be applied to the 
unordered categories with a latent multinomial distribution 
(Albert and Chib 1993). Another extension for the BPSP 
estimator is in the small area estimation, by combing small 
area random effects with the smooth spline on the inclusion 
probabilities (Opsomer, Claeskens, Ranalli, Kauermann and 
Breidt 2008). This extension will be the focus of future 
research. 

Finally, one reviewer questioned whether the proposed 
approach can be applied in a multipurpose survey with 
many outcomes, since the modeling procedure does not 
provide a single set of weights and needs to be repeated for 
all variables of interest. It is true that our methods are more 
computationally intensive than existing approaches, but the 
BPSP method can be easily implemented with a Gibbs 
sampling algorithm or using WinBUGS, so computing is 
not a major obstacle. We point out that the simulations in 
the paper involved repeating the iterative Gibbs analysis 
6,000 times, so an equivalent level of computation on a 
single survey of comparable size would allow the imple- 
mentation of the BPSP method for 6,000 outcomes! These 
were done on a garden-variety laptop PC. While we do not 
advocate automatic use of any analytical method, design or 
model-based, our point is that computational complexity is 
no longer a major obstacle to applying these methods. We 
suggest that the statistical properties of a method are more 
important than computing time, given modern day com- 
puting resources. 
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Appendix 
Algorithm of Gibbs sampling 
Model (3) can also be written in the matrix form, 
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The algorithm of Gibbs sampling for estimating the 
parameters in Model (3) is as follows: 


a) The probit regression model for the binary outcome 
i= piss y, |’ corresponds to a normal regression 
model for a latent continuous data y = Ly, ee 
y.]', which has a truncated multivariate normal 
distribution with mean (XP + Zh) and _ identity 
covariance matrix (Albert and Chib 1993), and y, is 
the indicator that y, > 0. With some initial values of 
(8, 5), values of the latent continuous data y, can 
be simulated. 

Specifying a proper flat normal prior distribution 
N(0,10°) on B and an inverse gamma distribution 
IG(0.1,0.1) on t°, the posterior distribution of 
(B, b, t”) given the simulated latent continuous data 
y is 


(B,b) | 17, y" 
PE MVN (CG C+D) 0 yo Coy" 
(C’C+Di’)") 
t’ |B, b ~ IG(0.1 + m/2, 0.1 + ||b||?/2), (11) 


b 


— 


where C =[X, Z] and D isa diagonal matrix with 
pt values of 10° followed by m ones on the 
diagonal. Gelman (2006) recommended a uniform 
prior distribution on t, which results in the posterior 
distribution for t” as 


ep, De lG (m= 1)/2) bi] 7/2)" 12) 


c) At iteration ¢, draws of (6, 6°, 1°) from the 
posterior distribution in equation (11) or (12) are used 
to generate new latent data ~ conditional on 
observed binary variable y for the sample, and to 
obtain the posterior predicted values >? for non- 
sample units. We then can obtain draws from the 
posterior distribution of the finite population 
proportion at iteration ¢ as 


pet = "(En + DH") 


ies JéS 
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The effect of nonresponse adjustments on variance estimation 


David Haziza, Katherine Jenny Thompson and Wesley Yung ' 


Abstract 


Many surveys employ weight adjustment procedures to reduce nonresponse bias. These adjustments make use of available 
auxiliary data. This paper addresses the issue of jackknife variance estimation for estimators that have been adjusted for 
nonresponse. Using the reverse approach for variance estimation proposed by Fay (1991) and Shao and Steel (1999), we 
study the effect of not re-calculating the nonresponse weight adjustment within each jackknife replicate. We show that the 
resulting ‘shortcut’ jackknife variance estimator tends to overestimate the true variance of point estimators in the case of 
several weight adjustment procedures used in practice. These theoretical results are confirmed through a simulation study 
where we compare the shortcut jackknife variance estimator with the full jackknife variance estimator obtained by re- 
calculating the nonresponse weight adjustment within each jackknife replicate. 


Key Words: Calibration; Nonresponse adjustment; Unit nonresponse; Jackknife variance estimator; Linearization 


variance estimator. 


1. Introduction 


Unit nonresponse, which occurs when, for a sample unit, 
all the survey variables are missing or when not enough 
usable information is available, is unavoidable in surveys. 
To address this, the nonrespondents are deleted from the 
data file and the survey weights of the respondents are 
adjusted to compensate for the deletions. The primary 
objective of a weight adjustment procedure is to reduce the 
nonresponse bias, which is introduced when respondents 
and nonrespondents are different with respect to the survey 
variables. Key to achieving an efficient bias reduction is the 
use of powerful auxiliary information available for both 
respondents and nonrespondents. 

In this paper, we consider jackknife variance estimation 
in the presence of unit nonresponse. This variance 
estimation method is widely used in practice because of its 
theoretical properties and computational ease. In contrast to 
Taylor linearization procedures, the jackknife method does 
not require a separate derivation for each parameter of 
interest nor the second-order inclusion probabilities that 
may be difficult to obtain in complex surveys. When using a 
jackknife variance estimator in the context of nonresponse, 
there is some question of whether or not the nonresponse 
adjustment needs to be replicated (e.g., Valliant 2004). In 
this paper, we consider two jackknife variance estimators: 
(1) a full jackknife variance estimator which recalculates the 
nonresponse adjustment factor within each jackknife 
replicate and (i1) a shortcut jackknife variance estimator, 
which does not. The shortcut jackknife variance estimator is 
convenient in practice but its theoretical properties were not, 
to our knowledge, fully studied in the literature. Production 
reasons tend to drive the usage of a shortcut jackknife 
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variance estimator, since the full jackknife variance 
estimator in the context of stratified sampling can be quite 
time-consuming and computer resource-intensive, espe- 
cially when a survey utilizes a large number of weighting 
cells. Some recent studies conducted at the U.S. Census 
Bureau (Thompson 2005 and Ozcoskun, Thompson and 
Williams 2005) found negligible differences between 
variance estimates obtained using a fully replicated weight 
adjustment procedure and those obtained using a “shortcut” 
procedure with stratified jackknife, delete-a-group jack- 
knife, and modified half sample variance estimators. 

Two types of adjustment procedures are commonly used 
in practice. The first, called monresponse propensity 
weighting (NPW), consists of first modeling the response 
propensities and using the inverse of the estimated 
propensities as the weighting adjustment. The estimated 
response propensities are typically obtained by fitting a 
parametric model (e¢.g., logistic regression model) or by 
fitting a nonparametric model; e.g., Da Silva and Opsomer 
(2006). A special case of NPW, which is very popular in 
practice, consists of first dividing the respondents and 
nonrespondents into weighting classes and adjusting the 
design weights of respondents by the inverse of the response 
rate within each class. These classes are formed on the basis 
of auxiliary information recorded for all units in the sample; 
see, for example, Eltinge and Yansaneh (1997) and Little 
(1986). The second type of adjustment procedures, called 
nonresponse calibration weighting (NCW) can be seen as 
an extension of the calibration approach (Deville and 
Sarndal 1992) adapted to the context of unit nonresponse. 
The reader is referred to Sarndal and Lundstrém (2005), 
Kott (2006) and Brick and Montaquila (2008) for a 
comprehensive overview of NPW and NWC. In some 
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situations, NPW and NCW lead to the same estimator; for 
example, the count-adjusted estimator presented below (see 
expression (1.4)). In this paper, we focus on NCW. The 
problem of variance estimation in the context of NPW has 
been recently studied by Kim and Kim (2007). 

Consider a finite population U of size N. The objective 
is to estimate the population total Y = >',., y,, of a variable 
of interest y. Suppose that a random sample s of size n is 
selected from U according to a given design p(s). In the 
case of complete data, a basic estimator of Y is the well- 
known expansion estimator given by 


Y= Vay, 


les 


Ci 


where d, = 1/7, denotes the design weight attached to unit 
i and m, = P(i € s) denotes its first-order probability of 
inclusion in the sample. In the presence of unit nonresponse, 
only a subset of s is observed, and so the computation of Y, 

in (1.1) is not possible. 

To define a nonresponse adjusted estimator of Y, we 
assume that a vector of auxiliary variables x is available for 
all the sampled units (respondents and nonrespondents) sO 
that the vector of estimated totals, Ks = eh nS 
available. We also assume that a vector of instrumental 
variables z, of the same dimension as x, is available for the 
respondents. Let 7, be a response indicator attached to unit 
i such that 7, = 1 ifunit 7 is a responding unit and 7, = 0, 
otherwise. To estimate Y, we consider calibration esti- 
mators of the form 


Roun =D yp 


1eS 


(1.2) 


where w, =d,g, and g, is a nonresponse weighting 
adjustment factor attached to unit 7 and given by 


elena eS (1.3) 
where Ki Syed: rx, and T. Sy a eee Winco Za — 
x,/v,, Where v, is a known constant, then the estimator 
(1.3) is identical to the /nfoS estimator given in Sarndal and 
Lundstr6m (2005, equation 7.15). The properties of the 
estimator (1.2) were studied by Deville (2002), Sautory 
(2003), Sarmndal and Lundstrém (2005) and Kott (2006), 
among others. 

In this paper, the properties (e.g., bias and variance) of 
Y.4, are studied using the nonresponse model (NM) 
approach, under which inference is made with respect to the 
joint distribution induced by the sampling design and the 
nonresponse mechanism, q(r|I), where I = (/, ..., 7.) 
is the vector of sample selection indicators such that J, = 1 
if unit 7 is selected in the sample and /, = 0, otherwise and 
r=(r,..., %)' is the vector of response indicators. Let 
Pp, = P(r, =1|1, J, = 1) be the response probability for 
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unit 7. We assume that p, > 0 for all i and that the units 
respond independently of one another; that is, p, = 
P@=1y =1|L1,=1 1, =1,14 jf) =pyp,. 

The estimator Y.,, is asymptotically unbiased for the 
true total Y if (i) p,' =1+2'z, forall i ¢ U, where 2 is 
a vector of unknown constants or (ii) y, = x/B for all 
i ¢ U, where B is a vector of constants; see Sarndal and 
Lundstré6m (2005, chapter 9.5). If the condition (i) is 
satisfied, the point estimator Y.,, is asymptotically 
unbiased for Y regardless of the variable of interest y 
being estimated. Also, it follows from (ii) that ¥.,, has a 
small bias if the residuals E, = y, — x; B, are small, where 
B = (\7Z;%;) DicuZ,¥, Wherefore, the bias of the 
estimator Y..,, is small if the vector x explains the variable 
of interest y. In the case of several variables of interest, 
note that the vector x may explain a given variable of 
interest well but may not be related to all, in which case 
some estimates could be potentially biased. We assume that 
ie a, 1S asymptotically unbiased for Y, so that the bias of 
the estimators under consideration is not an issue in the 
reminder of the paper. 

We consider three special cases of (1.2) that are of 
interest in practice (see also Kalton and Flores-Cervantes 
2003): First, lets0 (Onan, Gee 5.02,,) 4 Delaa C evecioisor 
weighting class indicators attached to unit i such that 
6, = 1 ifunit 7 belongs to class c and 6,. = 0, otherwise 
fob Cay Cex eee 20allic adjustment factor g 
given by (1. 3) reduces to g, = N/N,. 5,,, where N, 
yi, 4,6,.and Ny = edanoy, That i is, the nonresponse 
weighting adjusonent factor for a weighting cell is 
calculated as the sample-weighted number of sampled units 
in the weighting cell divided by the sample-weighted 
number of responding units in the weighting cell. We refer 
to this weight adjustment procedure as the count adjustment 
procedure. It follows that the estimator (1.2) reduces to the 
count adjusted estimator 


(1.4) 


where 


= 4, uf Sic Vie 

The second special case of (1.2) assumes that a 
continuous variable x is pias for all the sampled 
units» Let. x= (6, x50:, Oye, Oren ez aanO mein 
this case, the adjustment factor g, given by (1 3) reduces 
to g, = = Sap e 6, if unit 7 belongs to class c, where 
Ried os x, and. X .=hiel dwSecs WHeremwithe 
nonresponse weighting adjustment factor for a weighting 
class c is the sum of the sample-weighted auxiliary data 


for units in the weighting cell divided by the sum of the 
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sample-weighted auxiliary data for all responding units in 
the weighting cell. We refer to this weight adjustment 
procedure as the ratio adjustment procedure. The esti- 
mator (1.2) reduces to the ratio adjusted estimator 


he & 
ed = > 


C=) X 


A 


nen (1.5) 
Note that the count adjusted estimator (1.4) is a special 
case of the ratio adjusted estimator when x, = 1 for all the 
sampled population units. 
inal lye?) it een (OPN ON 280 Li On 1 hc, 
6,,X;5 +» 5,-X,),. we obtain another special case of (1.2). In 


this case, the adjustment factor g, given by (1.3) reduces to 


IC? 5 


(%, — Xe ) 


ye lt 6, (x; -x.)° 


ies 


g,=N, 


eli beatae aetice? 


> 


if unit i belongs to class c, where ¥,=X,/N, and 
x, = X/N_.. We refer to this weight adjustment proce- 
dure as the simple linear regression adjustment procedure. 
The estimator (1.2) reduces to the simple linear regression 
adjusted estimator 


(1.6) 


where 
ed, Fj 6,4 (%, oa Xe) (Y; Ve) 
B — igs : 
< 4, y; 6,.(%, a Xe) 

The estimators (1.4)-(1.6) use some form of weighting 
adjustment within classes. All of them are asymptotically 
unbiased for Y if the units have equal response probabilities 
within classes (i.e, a uniform nonresponse mechanism 
within classes). This condition is a special case of condition 
(i) discussed above. 

In this paper, we show that the shortcut jackknife 
variance estimator that treats the adjustment factors as fixed, 
tends to overestimate the true variance of Y.,,, at least in 
some simple cases. We build on earlier research by 
Thompson and Yung (2006) who derived expressions of the 
linearization version for both the full and shortcut jackknife 
variance estimators and evaluated these expressions empir- 
ically using data from the Annual Capital Expenditures 
Survey (ACES), conducted at the U.S. Census Bureau. In 
the context of NPW, it is interesting to note that Kim and 
Kim (2007) showed that treating the estimated response 
probabilities as fixed leads to an overestimation of the true 
variance when the sampling weights are not used in 
estimating these probabilities. Beaumont (2005) obtained 
similar results in the context of imputation when the 
response probabilities are estimated using a_ logistic 
regression model. 


37. 


In Section 2, we discuss the full and shortcut jackknife 
variance estimators and show that the shortcut estimator is 
asymptotically biased. The severity of this bias is evaluated 
for two commonly used sample designs in Section 3. 
Section 4 presents the results of a simulation study com- 
paring the full and shortcut jackknife variance estimators. 
We conclude in Section 5 with some general observations. 


2. Jackknife variance estimation 


Traditionally, variance estimation in the context of 
nonresponse has been performed using the two-phase 
framework, which consists of viewing nonresponse as a 
second-phase of selection. Instead, we consider the reverse 
framework that was proposed by Fay (1991) and further 
developed by Shao and Steel (1999). This framework 
provides a theoretical basis for studying the properties of 
jackknife variance estimators and can be described as 
follows: first, applying the nonresponse mechanism, the 
population U is randomly divided into a population of 
respondents U. and a population of nonrespondents U,,. 
Then, given (U,,U,,), the random sample s is selected 
according to the chosen sampling design. The total variance 
of ¥.,, can be expressed as 

V You) = EV, Your IW) +V,E, Year |): 


Cie 


(2.1) 


where £,(.) and V,(.) denote the expectation and the 
variance with respect to the sampling design and £,(.) and 
V,(.) denote the expectation and variance with respect to 
the nonresponse mechanism, q(r | I). 

In this section, we focus on stratified simple random 
sampling, which is the design typically used in business 
surveys. With this sample design, the population U is 
partitioned mto. ZL. strata .U,,.....U, of size N,,.:.;N;; 
respectively. A simple random sample without replacement 
S,, Of size n,, is selected from stratum hf, h = 1,..., L. 
Each within-stratum sample is selected independently, and 
we assume that n, = 2 forall A. In this context, the design 
weight of unit i in stratum / is d,, = N,/n,. A full 
jackknife variance estimator of Y.,,, under stratified simple 
random sampling, is obtained as follows: 


(i) remove unit (g7) from the sample, g = 1,..., L; 
dj Croce 
(ii) adjust the design weights d,, to obtain the jack- 


knife weights d,,,,,,, where d,,,,;, is given by 
0 if (hi) = (g/) 
ny . . 
Gh et EN RE ae) 
g. Fea 
& 
d otherwise 
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(1i1) compute the estimator Tageain the same way as 


Y.,, with the jackknife weights yg; instead of 
the design weights d,,,; that is, WEN = Gn 


where Writes) Cig Bis) with 


Wri gi Thi Vir 
(SC = I ef (xo = Xe ek ia 
Dies Trice n Xn ren = Lutanes 4 hi( gj) hihi and Tign= 
Lewes Trice) Ti Bai Xie 

(iv) replace the unit deleted in step (i) back into the 
sample; 

(v) repeat steps (i)-(iv) for all (gj) units, g = 
ar ie See 


Note that the nonresponse adjustment factors g,, are 
recalculated in each replicate. This leads to the full jackknife 
variance estimator 


L n, —] 


vor Pee aes CAL(gj) oral (2:2) 


g=l n, JES), 


The variance estimator v,. is an estimator of the first 
term on the right hand side of (2.1), E, VY, ag): Uhis 
term represents the design variance that we would have 
obtained if the responding units were selected using 
stratified simple random sampling with replacement, or 
equivalently, if the stratum sampling fractions, (7, / N,,) are 
negligible. In other words, the full jackknife variance 
estimator (2.2) is an estimator of the sampling variance 
conditional on the vector of response indicators r. 
Therefore, V, 18 asymptotically unbiased and consistent for 
E,V,(Ya.| ¥) under stratified simple random sampling 
with Ap pectic sampling regardless of the validity of the 
underlying assumptions. Note that since v,,, is an estimator 
of a sampling variance, it can be readily obtained using 
software designed for complete-data jackknife variance 
estimation. In other words, no specialized software is 
needed. Also, note that the second term on the right hand 
side of (2.1), Ven | r), is not accounted for. Thus, the 
full jackknife variance estimator does not track the second 
term in (2.1). However, the contribution of this term to the 
total variance is negligible if the stratum sampling fractions, 
n,/N,, are negligible. As a result, v,, is asymptotically 
unbiased and consistent for the total variance, Vie ae): 
That is, E(vje) V (Ys, ). Since the goal of the research 
is to compare the full and shortcut jackknife estimators, in 
the remainder of the paper, we assume that the stratum 
sampling fractions are negligible and focus on estimates of 
totals, so that we can omit the estimation of the second term 
in (2.1). We note that even if the second term is not 
negligible, our comparisons are valid as both the full 
jackknife and shortcut estimators would underestimate the 
total variance by the same term. 

A shortcut jackknife variance estimator of Y.,, is given 


by 
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vag 


2 
Visim Dae 2 (or Hae) , 


g=l JES), 


(2.3) 


where Ve (ef) = Lines nig; Sri Mi Vir ~Note that the 
nonresponse weighting adjustment factors g,, are not 
recalculated in each jackknife replicate. In other words, the 
factors g,, are treated as constants, which is inappropriate 
since they depend on the sample and the set of respondents. 
Therefore, we have E’,,(v,;) # V(¥-.,), in general, and 
the shortcut variance estimator, v,., is biased. 

To study the magnitude of the bias of v,,, we consider 
the difference of the two jackknife variance estimators, 
D =v, —V,-. Since the variance estimator v,. is an 
asymptotically unbiased estimator of the term V ae elias 
it is asymptotically equivalent to a variance estimator 
obtained using a first-order Taylor expansion. The resulting 
variance estimator, denoted by Y,,, is the linearization 
jackknife variance estimator studied by Yung and Rao 
(2000). Similarly, the shortcut jackknife variance estimator 
Vv; 1S asymptotically equivalent to a variance estimator of 
Ve (Oe au | ©) obtained by treating the nonresponse weighting 
adjustment factors g,, as constants. We denote this variance 
estimator by ?,,. The quantity D can thus be approximated 
by D =, —¥,,. For this approximation to be valid, we 
assume the number of respondents to be large. 

Noting that Bias(v,,) =E,.(Vj-) —-V(Yoa,) = 0, it fol 


Pq 


lows that the bias of v5, Bias(vj;) = E,,(Vjs) — Vii) 
can be approximated by £,,(D) = E,,(D). Let v(y) 


denote the variance estimator of the complete data estimator 
(1.1). Using a first- order meee expansion, it can be shown 
that an estimator of V, (Yq, | ¥) is given by 


Vm =v) (2.4) 


where 


Ey = Xi B, oe Shi Nii Cni> 
with ey; = Gene xB: ) and B. a ale Diiperds Ny; Z hi Vnie 
On the other hand, treating the g,,’s as constants implies 
that Y.,, is linear in the design weights d,,,. It follows that 
Y), 1S given by 


hi* 


Vy, = VW), (235) 


where Whi a Shi lis Viv 
For example, for either a fixed size or a random size 
sampling design, a possible variance estimator 1s 
= Sane 
Pin = DD AVES, 
ies jes 
where A, = (1, — 1,%,)/tyt, m, and 7m, is the second- 


order inclusion probability of units i and j. Note that 
T,, = 1, Similarly, we have 


iW 
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Vis = Duy ray VW; Yj. 


ies jes 


3. Bias of v,, in some special cases 


3.1 Simple random sampling without replacement 


In this section, we assume that the sample s has been 
selected according to simple random sampling without 
replacement. We also assume that the sampling fraction 
n/N is negligible and that the number of respondents r is 
large. Finally, we assume a single weighting class. Although 
the above situation is not realistic in practice, it provides 
some insight into the asymptotic bias of v,,. 

In the case of the ratio adjusted estimator (1.5), we can 
show that D is approximately given by 


+ ye (2) a-#h(2] =f (3.1) 
a as 


where (X., ¥,) = 1/r Die, % (x, y,) denote the mean of the 
respondents for variable x and y respectively and r is the 
number of respondents, R, = ¥,/¥,, 82. = 1WM(r —1) Diet; 
(Ge -x.), Se =1/(a-1) Dies (%, ~¥) with B=1Lind 2X 
se Vin) Year Gy as andges: <= b/(r— lpi aa ir 
(y, — R.x,)x,. If we further assume that all units have equal 
response probabilities (7.e., a uniform response mechanism), 
we have x/X,—*->1 and s°./s? —*-41. In this case, the 
asymptotic bias of v,, is given by 


Bias (v,;)~E,,,(D) 


a (5) 
E(r) Phan 


pee) Py 
CVO) aa 


a 


si( g& ri 7 vw (3.2) 
TONG) CV(yy 

where CV(x) =S,/X and CV(y)=S,/Y denote the 
population coefficients of variation for variables x and y, 
respectively with S? =1/(N -1)¥,.y(y,-Y) and ¥ = 
1/N>Y jy; S2 and X are defined similarly, and p,, 
denotes the finite population coefficient of correlation for 
variables x and y. From (3.2), it follows that the 
asymptotic bias of v,, is nonnegative if and only if 


Y eer) 


Balmaueny 


3 GB) 


0 
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provided 0 < E,,(r/n) <1, where B, = Y — B,X is the 
finite population intercept of the least squares line when 
regressing y on x with 


>, - X) 0, -Y¥) 
B, = icU — 
aden Go 
ieU 

From (3.2), it is clear that the bias of v,, increases if (i) 
the expected response rate E,.(r/n) decreases; (ii) p,, 
increases; (ii1) CV(y) decreases; or (iv) CV(x) increases. 
Also, it follows from (3.3) that v,, overestimates the true 
variance when the intercept B, is not too large. Table | 
illustrates the relationship between CV(x) and_ the 
condition in (3.3). For example, when CV(x) = 0, v,s 
always overestimates the true variance since, in this case, 
the condition (3.3) reduces to B, < ©, which is always 
satisfied. This result is not surprising because when 
CV(x) = 0, the x-values are all equal and the ratio 
adjusted estimator (1.5) is identical to the count adjusted 
estimator (1.4). As we discuss below, v,, always over- 
estimates the true variance in this case. When CV(x) is 
large (e.g., CV(x) = 2), v,, overestimates the true variance 
if and only if B, < 0.625Y The latter condition is satisfied 
if the intercept is not “too far” from the origin. Therefore, if 
the relationship between y and x goes through the origin 
(i.e., if the ratio model holds), the shortcut variance 
estimator will overestimate the true variance. However, if 
the ratio adjusted estimator is used when the ratio model 
does not hold, such as when B, > 0.625Y, the shortcut 
variance estimator v,, will underestimate the true variance. 
In conclusion, we can expect v,, to overestimate the true 
variance when a ratio adjustment procedure is used unless 
the ratio model is highly misspecified for the data at hand, 
which could happen, for example, if the variables y and x 
are negatively correlated. 


Table 1 
Relationship between CV(x) and the condition in (3.3) 
CV(x) Y¥(1+CV(xy 
2 CV(x)’ 
0 0 
0.1 50.5 Y 
0.5 2.5 
l 2% 
1S 0.722 ¥ 
2 0.625 Y 


Turning to the count adjusted estimator (1.4), we let 
x, = 1 forall 7 in (3.1) and obtain 


D= oe f-=] es 
ifs n 
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It follows from (3.4) that the relative bias of v,., 
RB(v,,) = Bias(v,,)/V (Yq, )» Can be approximated by by 
E,,(RD) where RD=D/*,,.. Under a uniform nonresponse 
mechanism, straightforward algebra leads to 


RB(V,,) ~ E,, (RD) = C ae (=)} ae ee) 

The expression (3.5) shows that, in the case of the count 
adjusted estimator (1.4), v,, always overestimates the true 
variance. The magnitude of the overestimation increases as 
the expected response rate E,(r/n) decreases or when 
CV(y) decreases. For example, if the expected response 
rate is equal to 70% and CV(y) = 1, we have E,,, (RD) = 
1.3 so the shortcut jackknife variance estimator, v,,, is on 
average 30% larger than the true variance of Y.,,. On the 
other hand, if the response rate is equal to 70% and 
CV(y) = 0.5, we have E,, (RD) = 5.3, in which case the 
overestimation is considerable. 

Finally, we turn to the case of the simple linear regres- 
sion adjusted estimator (1.6). Under a uniform nonresponse 
mechanism, it can be shown that the asymptotic bias of v,, 
is given by 
Bias(v,,) * E,, (D) 


Pg 


S) 


1 2 
3 ales 

S\, at +63, | zs (3.6) 

From (3.6), it follows that v,, always overestimates the 
true variance in the case of the simple linear regression 
adjusted estimator (1.6). The bias (3.6) increases if (i) the 
expected response rate decreases; (ii) p, increases; or (iii) 
CV(y) decreases. . 


3.2 Stratified simple random sampling: Weighting 
classes are identical to strata 


In this section, we assume that the weighting classes 
coincide with the original design strata. This situation is not 
uncommon in practice, especially in business surveys. If the 
strata are such that the units within stratum have approxi- 
mately equal response propensities (i.e., uniform response 
within stratum), expressions for the bias of v,, are readily 
obtained from expressions (3.2), (3.4) and (3.6). 

For the ratio adjusted estimator, expression (3.2) can be 
readily extended to the case of stratified simple random 
sampling to obtain 
Bias(v,,)* E (D) 


Pq 


Be yal aap (4 
h=| AS) es Ny, 


NAC) eon 


; 1 
yh CV, (y) 2 GV; (y) Axy CV, (y) 2 
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where the quantities r,, CV, (x), CV,(y), S;, and p,,, cor- 
respond to r, CV(x), CV(y), S* and p,, computed in each 
stratum. . 

For the count adjusted estimator, expression (3.4) can be 
readily extended to the case of stratified simple random 
sampling to obtain 
Bias (v,,)  E,, (D) 


PY 


= ss Ni 1-E =| co ali : 
h=\ ET (7,) - nN, CV, (v) 


Finally, for the simple linear regression adjusted esti- 
mator, expression (3.6) can be readily extended to the case 
of stratified simple random sampling to obtain 


(3.8) 


Bias(v,,) » E,,(D) 


pq 


4 eel ie [| 
ya) My 


34 ] 9 
Ss. Sa Dire 
yh | CV, (9) Pry 


From the expressions (3.7)-(3.9), it follows that the use of 
the shortcut jackknife variance estimator requires some 
caution. Indeed, even if the bias of the shortcut jackknife 
variance estimator is small in each stratum, they might sum 
up to a considerable bias at the population level if the biases 
are in the same direction. 


(3.9) 


4. Simulation study 


A simulation study was performed to compare the 
statistical properties of the shortcut and the full jackknife 
variance estimators under varying conditions. Five different 
stratified populations of 30,000 units each with two 
variables were generated. First, the x-values were generated 
from a Gamma distribution with parameters a and 4. Then 
given the x-values, the y-values were generated according to 
the following model: 


Vig Po epee aes 


2 


where €,, ~ N(0,o.,). The variance and o2, was set such 
that the coefficient of correlation (denoted p,) between 
x, and y,, is equal to 0.7 in all the populations. Each 
population was stratified into three strata, each with 10,000 
units. The parameters of the simulated populations appear in 
Table 2. 

Population 1 fits the ratio model very well with an 
intercept of zero in all strata. Population 2 has a non- 
negligible intercept term in all three strata. Population 3 is a 
mix of populations 1 and 2, where the ratio model fits well 
for strata 2 and 3 but not for stratum 1. Population 4 is 


Survey Methodology, June 2010 


similar to population | except units in strata 1 and 2 have a 
70% chance of reporting a zero. This population is intended 
to mimic the situation of the Annual Capital Expenditures 
Survey (ACES) of the U.S. Census Bureau, which provided 
the motivation for this research. The ACES employs a 
shortcut jackknife variance estimator that, empirically, has 
been shown to be close to the full jackknife variance 
estimates. Its population is characterized with many zeros 
for capital expenditures in the majority of sampled small 
and medium businesses, with the majority of the reported 
expenditures being provided by large businesses. Population 
5 was generated to show that the shortcut estimator for the 
ratio adjusted estimator can actually have a negative bias 
when the ratio model is misspecified (demonstrated in 
expression (3.3) for a simple random sample). For this 
population, the intercept term is highly significant in all 
strata. 


Table 2 
Population parameters 


Population B, B, Quek 
(Within Stratum) (Within Stratum) 
1 2 3 1 2 3 


CV(x) CV(y) 


] 0 0 0 ) 4 6 4 5 50% 76% 
2 120 240 360 2 4 6 4 5 50% 44% 
5 120 0 0 7) 4 6 ANS) 50% 51% 
4 0 0 0 2 4 6 4 5 50% 134% 
5 50 200 300 0.5 l 2 4 5 200% 63% 


From each population, 5,000 stratified simple random 
samples of size 300 (100 units per stratum) were drawn. In 
each sample, nonresponse was generated using a uniform 
response mechanism within each stratum with probabilities 
of response equal to 60% in stratum 1, 70% in stratum 2 and 
90% in stratum 3. This response pattern is not uncommon in 
business surveys where more follow-up is performed for the 
medium and large size units (strata 2 and 3). 

In each sample, both the count adjusted and the ratio 
adjusted estimators, given respectively by (1.4) and (1.5), 
were calculated using the strata as weighting classes. The 
variance of the point estimators was estimated by v,, and 
Vjs5 given respectively by (2.2) and (2.3). As a measure of 
the bias of a variance estimator v, we used the Monte Carlo 
percent relative bias given by 


1 9 vO ~MSE wc (Your) , 


RB (v) = RB 
eS 5,000 42: MSE, 2.2) 


100, 


where v” is the variance estimate obtained from the r™ 


sample, and MSE,,o(You,) is the Monte Carlo Mean 
Squared Error (MSE) defined by 
1 50,000 


yo aah 2 
50,000 2 (ean ~¥) 


MSE vc ae = 


41 
where Y<"), is the (ratio or count adjusted) estimate of Y for 
the 7 sample. Table 3 shows the Monte Carlo percent 
relative bias for both the count adjusted and the ratio 
adjusted estimators. 


Table 3 
Monte Carlo percent relative bias for the shortcut and full 
jackknife variance estimators 


Population Count adjusted estimator Ratio adjusted estimator 
RByc (ys) RByc(y) RBycQss) RBycOu) 

l 57.3% 1.1% 80.5% -0.3% 

2 877.1% 0.4% 364.7% 0.5% 

3 220.7% 0.6% 185.9% -0.2% 

4 21.6% 0.6% 29.1% 1.4% 

5 266.4% 0.2% -67.2% 5.0% 


As expected, the shortcut estimator overestimates the 
Monte Carlo MSE for the count adjusted estimator for all 
populations. The overestimation varies from approximately 
20% in population 4 to over 800% in population 2. From 
expression (3.8), we see that the bias of v,, depends on the 
response rate and Y,. Population 2 has a large intercept 
term which increases CV,()) in all strata, which in turn 
increases the bias of v,,. Population 3 is similar to 
population 2 except only the first stratum has a large 
intercept term. As expected, the bias of v,, in this 
population is between those of populations | and 2. 
Population 4 is the one generated to mimic the ACES 
population with some units’ values replaced by zero in strata 
1 and 2. The Monte Carlo relative bias of 21.6% is, for the 
most part, coming from the third stratum where no units 
have been replaced with zero (this can be seen using 
expression (3.8)). In comparison, for all five populations the 
full jackknife variance estimator is tracking the Monte Carlo 
MSE very well with absolute relative biases less than 1.1%. 

Turning to the ratio adjusted estimator, we see that the 
full jackknife variance estimator again tracks the Monte 
Carlo MSE relatively well for all populations with absolute 
relative biases less than 5%. The shortcut estimator, on the 
other hand, has relative biases varying from -67% to 364%. 
Looking at expression (3.7), we see that for a fixed response 
rate the bias depends on the CV,(y), CV,(x) and p,,,. 
Due to the large intercept terms in the second population, 
y, are large and the corresponding CV,()) are smaller 
than in the other populations. Thus, the last term in 
expression (3.7) is quite large and the resulting relative bias 
of v,, is also large. This is also seen for population 3 except 
to a lesser extent since only the first stratum has an intercept 
term. The opposite effect is seen in population 4, where the 
introduction of zeros has significantly increased CV, (y) 
which has in turn reduced the Monte Carle percent relative 
bias of the shortcut estimator. 
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Additional simulations were performed using the some 
of the populations described in Table 2 but with varying 
response rates. The results are not presented here as they 
were as expected. That is, the bias of the shortcut estimator 
deceased as the response rate increased (with all the other 
parameters remaining fixed). The full jackknife estimator 
continued to track the Monte Carlo MSE very well. 


5. Conclusion 


In this paper, we evaluated both theoretically and 
empirically a shortcut jackknife variance estimator that does 
not re-calculate the nonresponse adjustment factors within 
each jackknife replicate, specifically considering three 
different nonresponse weighting adjustment procedures. We 
showed in the context of stratified simple random sampling 
that the shortcut jackknife variance estimator tends to 
overestimate the true variance of the estimators. In the 
context of the ratio adjustment procedure, however, the 
shortcut jackknife variance estimator may underestimate the 
true variance if the ratio model is not appropriate for the 
data at hand. 

One justification for the use of a shortcut procedure in a 
replicate variance estimation method is to save time and 
computing resources. If these are truly issues and the 
program has consistently high unit response rates in all 
weighting cells, then while there are clearly theoretical 
advantages to replicating the weight adjustment procedure, 
there may be little or no practical advantage. Having said 
that, the conditions for “practical” equivalence between the 
full and shortcut procedure variance estimators are 
extremely restrictive, and we have demonstrated that small 
changes in underlying data conditions can easily violate 
these conditions. If computational concerns with a full 
jackknife are truly an issue, then the authors recommend the 
linearization jackknife variance estimation approach which 
has the same asymptotic properties as the full jackknife, but 
is computationally quick and computer overhead “free” (in 
terms of replicate storage). See Thompson and Yung (2006) 
for expressions for the linearization jackknife variance 
estimator for both the count and ratio adjusted estimators. 
Given these viable alternatives, we recommend against the 
use of a shortcut procedure variance estimator. 
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A comparison of variance estimators for poststratification 
to estimated control totals 


Jill A. Dever and Richard Valliant ' 


Abstract 


Calibration techniques, such as poststratification, use auxiliary information to improve the efficiency of survey estimates. 
The control totals, to which sample weights are poststratified (or calibrated), are assumed to be population values. Often, 
however, the controls are estimated from other surveys. Many researchers apply traditional poststratification variance 
estimators to situations where the control totals are estimated, thus assuming that any additional sampling variance 
associated with these controls is negligible. The goal of the research presented here is to evaluate variance estimators for 
stratified, multi-stage designs under estimated-control (EC) poststratification using design-unbiased controls. We compare 
the theoretical and empirical properties of linearization and jackknife variance estimators for a poststratified estimator of a 
population total. Illustrations are given of the effects on variances from different levels of precision in the estimated 
controls. Our research suggests (i) traditional variance estimators can seriously underestimate the theoretical variance, and 
(11) two EC poststratification variance estimators can mitigate the negative bias. 


Key Words: Estimated-control poststratification; Sampling frame coverage bias; Survey-estimated control totals. 


1. Introduction 


Poststratified estimators, and other calibration estimators, 
are used in many types of surveys to reduce variances or to 
correct for frame deficiencies. Specific examples include 
large U.S. government surveys, such as the Consumer 
Expenditure Survey (see, e.g., Jayasuriya and Valliant 1996); 
surveys of specialized populations, such as the U.S. 
Department of Defense Survey of Health Related Behaviors 
among Military Personnel (Bray, Hourani, Rae, Dever, 
Brown, Vincus, Pemberton, Marsden, Faulkner and 
Vandermaas-Peeler 2003); and a myriad of surveys outside 
the U.S. including the Canadian Retail Trade Survey (see, 
e.g., Hidiroglou and Patak 2006), the Swedish Labour Force 
Survey (Mirza and Homgren 2002), and the British 
Household Panel Survey (Taylor, Brice, Buck and Prentice- 
Lane 2007). 

Calibration estimators, such as those generated under 
poststratification, are used to minimize errors associated with 
incomplete sampling frames (i.e., undercoverage) and with 
sampling and nonresponse (see, e.g., Sarndal, Swensson and 
Wretman 1992; Lessler and Kalsbeek 1992; Kott 2006). For 
example, estimates from the Behavioral Risk Factor 
Surveillance System (BRFSS), a nationwide random-digit- 
dial (RDD) telephone survey conducted by the U.S. Centers 
for Disease Control and Prevention (CDC), are poststratified 
to counts that include households with and without landline 
telephone service (Centers for Disease Control and 
Prevention 2006). The decrease in the errors is linked to the 
association of the population control totals with the frame 


undercoverage, patterns of non-ignorable nonresponse, and 
the variable of interest (Kim, Li and Valliant 2007). 

When relevant population controls do not exist, many 
researchers use survey-estimated control totals, and apply 
traditional variance formulae as if the controls were known 
without error. For example, Nadimpalli, Judkins and Chu 
(2004) adjusted weights for the 2003 National Survey of 
Parents and Youth to the number of U.S. households with 
children ages 9-18 estimated from the Current Population 
Survey (CPS) using a ratio-raking algorithm (www.census. 
gov/cps). Estimates of how people in the U.S. spend their 
time can be calculated from The American Time Use Survey 
using weights that have been poststratified to projected 
estimates from the U.S. decennial Census (Killion 2006). 
More recently, researchers at the Pew Research Centers 
calibrated weights for a set of 2008 U.S. presidential pre- 
election surveys to population estimates from the March 
2007 CPS, as well as to estimates on telephone usage 
patterns from the July-December 2007 National Health 
Interview Survey (Keeter, Dimock and Christian 2008). 

The goal of our research is to develop and evaluate 
variance estimators for point estimates with weights that 
contain a poststratification adjustment to a set of survey- 
estimated control totals. We label the methodology which 
properly accounts for the estimated controls as estimated- 
control (EC) poststratification. In this paper, we focus 
specifically on the EC poststratified (ECPS) estimator of a 
population total for data collected from a stratified, multi- 
stage design, where the first-stage sampling units are selected 
with replacement. The remainder of this section gives a brief 
review of weight calibration and poststratification. Section 2 
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contains an explicit definition of the ECPS estimator under 
study, followed in Section 3 by an evaluation of the bias 
properties. Through a theoretical evaluation (Section 4) and a 
simulation study, we compare variance estimators developed 
for the ECPS estimator with a variance estimator chosen 
under the naive “population control total” assumption. Both 
linearization and replication variance estimators are 
examined in our research. We provide illustrations on the 
effects of different levels of precision in the estimated 
controls on the variance estimates. The specifications for the 
simulation study are detailed in Section 5, followed by a 
summary of the results (Section 6). We conclude the paper 
with a brief summary and an overview of future research in 
this area. 

Calibration estimators (Deville and Sarndal 1992), such 
as a poststratified estimator of a population total, borrow 
strength from auxiliary information to improve the effi- 
ciency of survey estimates over simpler weighting methods. 
When the auxiliary variables are (linearly) related to the set 
of key survey variables, calibration estimators can be very 
efficient. 

The general form of a ¢raditional or fixed-control 
calibration estimator is best described as an expansion 
estimator or “linear weighting” estimator as discussed in 
Estevao and Sarndal (2000). Define s to be the set of sample 
elements from a probability sample, and d, = 1/7, to be 
the design weight for element k such that m, = Pr(k € s). 
An estimated population total of a variable y is f, = 
Des V;,. Where the calibration weight (w, = a, d, ) for 
the k" element defined as a function of the design weight, 
d,, and a calibration-adjustment factor, a,, also known as 
a g-weight (Sarndal er a/. 1992). The calibration weights are 
calculated by minimizing a specified function that measures 
the distance between the design and calibration weights 
subject to a set of constraints defined as: 


Ure = te (1) 


where t,, = X,2yX,, the vector of population controls 
(counts) corresponding to the G(G21) auxiliary 
variables, t, = D,<,w,X,, the estimated population 
controls corresponding to the components of t,,; and x, is 
a vector of length G containing auxiliary or benchmark 
variable values for element k. Note that x, may contain 
ones and zeros to indicate the presence or absence of a 
certain characteristic (e.g., age 18-25), or larger values (e.g., 
number of children). An example of such a calibration 
system is the generalized least squares (or chi-square) 
distance function Yi... (Ww, — d Pied, that is minimized 
subject to the constraints in (1). This system generates a 
closed-form solution called the generalized regression 
estimator (GREG) for c, = 1 (Deville and Sarndal 1992). 
The poststratified estimator is a special case of the GREG. 
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Variance estimation techniques for the poststratified 
estimator, and more generally for the GREG, have been 
widely studied. Binder (1995) demonstrates techniques used 
to calculate a Taylor linearization variance estimator for the 
GREG. Additional references for the linearization variance 
estimator under poststratification (and calibration more 
generally) include Deville, Sarndal and Sautory (1993), 
Demnati and Rao (2004), and Hidiroglou and Patak (2006). 
Sarndal, Swensson and Wretman (1989) developed an 
approximate linearization variance for the GREG of a 
population total as a function of the population residuals 
from a specified model and the design weights (d,). 
Valliant (1993) and Yung and Rao (1996) modified the 
residual-based variance estimator by multiplying the sample 
residuals by the calibration weights w,(= a,d,). They 
demonstrated that this revised estimator, created by lin- 
earizing the associated jackknife, reduced the bias asso- 
ciated with the original formula. This variance estimator is 
also discussed in Sarndal et al. (1992), Stukel, Hidiroglou 
and Sarndal (1996), and in Chapter 11 of Sarndal and 
Lundstr6m (2005). Properties of replication variance 
estimators (i.e., jackknife and BRR) have been examined in, 
for example, Valliant (1993), Rust and Rao (1996), Canty 
and Davison (1999), Théberge (1999), Rao and Shao 
(1999), Yung and Rao (1996; 2000), and Kott (2006). 

An assumption in the articles above is that the control 
totals, to which the auxiliary sample estimates are adjusted, 
are either true population values known without error, or are 
taken from an independent, highly precise survey that is 
much larger than the survey requiring calibration. In some 
cases, however, these controls are estimated from other 
surveys with non-negligible sampling variances. For 
example, there are efforts to calibrate Web panel surveys to 
separate, higher-quality reference surveys that are not much 
larger than the panel surveys themselves (e.g., Krotki 2007; 
Terhanian, Bremer, Smith and Thomas 2000). 

Many researchers apply formulae developed for tradi- 
tional poststratification even though the controls have been 
estimated. The tacit assumption is that any additional error 
(variance and bias) associated with these controls is 
negligible and can be ignored. Currently, the validity of this 
assumption can not be checked until a complete picture of 
EC poststratification has been developed. 


2. The estimated-control poststratified estimator 


To facilitate our discussion of the estimated-control post- 
stratified estimator, we label the survey requiring post- 
stratification as the analytic survey and the source of the 
control totals as the benchmark survey. In practice, more 
than one benchmark survey may be tapped for the control 
totals. However, we will assume only one benchmark 
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survey for the theoretical development so that control total 
variances and covariances are estimable. 

Let U represent the finite target population containing N 
elements and ¢, = },-y y, represent the population total of 
interest for a variable y. Let s, represent a random sample 
of size n, from the frame U, for the analytic survey. A 
random sample s, of size n, is selected for the benchmark 
survey from the corresponding sampling frame U,. We 
allow the possibility that each of the frames, U, and U,, 
do not completely cover the target population U. However, 
coverage is treated as a random event so that all elements in 
the target population have a positive probability of being 
covered by either the analytic or the benchmark survey 
frame. 

As a convention throughout the paper, an “A” subscript 
signifies an association with the analytic survey such as a 
sample design parameter or an estimate. A “B” subscript 
identifies the benchmark survey quantities. These subscripts 
are absent from the parameters associated with the 
population of interest, i.e., 7 

For the stratified, multi-stage design assumed for the 
analytic survey, m,,(m,, 2 2) primary sampling units 
(PSUs), indexed by i, are selected with replacement from a 
total of M,, PSUs in the h" design stratum (h =1,...,H 
with H 2 2). We assume that 1,,, elements, each indore 
by k, are selected from N,,, in PSU Ai in such a way that 
an unbiased estimate of the PSU total can be made. The 
design weight, d,, is calculated as the inverse of the 
unconditional inclusion probability for k € s,,,, the set of 
analytic survey elements within the hi" PSU. Thus, 7,, 
the size of the analytic survey sample, is calculated as 
ny = jad n,,,. Elements for the benchmark survey are 
randomly drawn from the corresponding sampling frame; 
no explicit specifications are made for the random sampling 
method. 

Poststratification can be used to correct for sampling and 
coverage errors. Therefore, we allow undercoverage in the 
analytic-survey, as well as, the benchmark-survey sampling 
frames. Additionally, we do not consider the effects of 
nonresponse. 

Suppose that the population U can be divided into 
g =1,...,G mutually exclusive and exhaustive poststrata. 
When the population count of elements, N,, is known for 
each poststratum, the traditional poststratified estimator of a 
total for v is defined as 


; cas Tol 
typs = Dae aos : 
gal Ag 


where y, is the value of the analysis variable y for element 
KS ty = res, 5e¢4.¥;, the total of y in poststratum g esti- 
mated from the analytic survey data; Ny, = di-,,5 

the analytic survey estimated total in poststratum g; and 
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6,, = 1 indicates membership in the g" poststratum and 
zero otherwise. Note that bis may also be expressed as 

igs aie d,y,, where S 4g indicates the set of analytic 
survey elements in poststratum g. The “hat” notation in the 
expression above is used to distinguish a population 
estimator (e.g., N 4g) from the known population parameter 
(e.g., N,). If the count of elements in poststratum g is 
estimated by setting y, =1 in the formula for ¢ yoo then 
t,ps equals N,. In this sense, fp. is poststratified to the 
population counts N,,..., NV. 

In certain situations, however, the population counts are 
not available and must be estimated from a benchmark 
survey. Define the ECPS estimator of a population total of a 
variable y as 


G 
hp = > N,, (3) 
yP yy bg N 


The number of population elements in the g™ 
poststratum (g = 1,...,G@) estimated from the benchmark 
survey is denoted as Nog = Lies, Wj, Where s,, is the set 
of sample elements in senuetante g from the benchmark 
survey and w, is the weight associated with the /" 
element. The calibration-adjustment factors applied to the 
analytic survey design weights for iL p are calculated as 
aaa ips LOL KcesS 4: 

Relating the poststratified estimators to the calibration 
system discussed in the previous section, t,. is a G-length 
vector of estimated population counts for each ae 
suchmethate't = (8B. Py.) e where 91, = NV 
Dies, 45 and x, =5, =1 if the element k is a 
member of the g™ poststratum and 0 otherwise. The vector 
t,,, corresponds either to N = (N,..., Ng ) for the bong 
estimator given in (2), or to Nz =(Nzg,,--..Ngg)> a 
G x1 vector of benchmark control estimates, for the /,, 
estimator given in (3). 

_ The estimator ¢,, can be Sead in matrix notation as 

» = N.Y, where Y, = NG oe a Gx1 vector of 
ative aoe estimates of the a n'y rene Bake 
Peel A= = diag(N ,, ..., Nyg), a diagonal matrix 
of bee aes totals estimated a the analytic survey; and 
i =n tees ll is a Gx1 vector of poststratum 
totals for the outcome variable estimated from the analytic 
survey. The remaining variables associated with the matrix 
notation were defined previously. 

An effective poststratification adjustment can reduce the 
bias in the resulting point estimates and will either reduce or 
minimally inflate the variance in comparison to the 
unadjusted weight. This effect is well known for traditional 
poststratification; we provide the comparative evaluation 
under an estimated-control setting in the next sections. 
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3. Bias in the ECPS of a population total 


Traditional poststratification is known for reducing the 
bias associated with an incomplete sampling frame. This 
reduction is most successful when poststrata are formed 
such that the within-poststratum correlation of y, with the 
probability of the k" element being included on the 
sampling frame is very near zero (Kim, Li and Valliant 2007). 

To evaluate the (unconditional) design-based bias for 
i,», We must account for the random property of four 
components — the analytic and benchmark sample designs 
and the population coverage propensities for the 
corresponding sampling frames. Following the work of 
Kim, Li and Valliant (2007, equation 2), the approximate 
yes bias of 7,, as an estimator of the population total 

= Dreu Ve is calculated as 


Bias (f yp) =E Gye 7 


Ile 
Ms 
= 


[aE Ipen Cov(y,,0 )o,: (4) 
yg N Bg PANN SEAM IENS 


where J, is the population size for the set of elements U, 
within ccs een ent Be = =) (Mog ), the expected value 
of the poststratum estimates under ‘the benchmark survey 
design; Cov (y,, 4c) = Ny Lee, Cae (Os. - Vig ), 
the population covariance between the outcome variable 
(y,) and the coverage propensities (,,) within post- 
stratum g;y, =t,,/N,, the g' poststratum mean of 
Vs tye = Lev, Vj» the population total of y within 
poststratum g; and o,,= N4,/N,, the average coverage 
propensity within the poststratum under the analytic survey 
design with N,, = E(N,,). Note that the population total 
may also be expressed as ¢, = Let, 

Components of the bias are zero only under certain 
conditions. (i) If N,, = N, for all g (ie., no coverage 
errors in the benchmark stitial fits frame), then the bias is 
dependent only on the association between the outcome 
variable and the coverage propensities, Cov(y,,,,). The 
value of Bias(f,,) then reduces to the eas provided in 
Kim, Li and Valliant (2007, equation 2) for the traditional 
poststratified estimator, f,p.. (ii) If the coverage proba- 
bilities are constant within each poststratum (i.e., 4, Sie 
k €U, forall g), then the second bias component is zero. 
Only if both conditions are satisfied can we say that 7,p is 
approximately unbiased. Some may argue that a “perfect” 
combination of poststrata could be formed such that the 
positive and negative components cancel; however, we 
believe this likelihood to be so rare as to be virtually 
impossible. 

Having examined bias, we present an evaluation of the 
variance of ¢,,. For some estimators, the contribution of the 
bias (squared) to the total mean square error (MSE) is small 
relative to the variance. 
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4. Variance estimation for the ECPS 


Variance estimators have been developed for traditional 
poststratification and are available in software designed to 
analyze survey data, e.g., R°(R Development Core Team 
2009), SAS” (SAS Institute Inc. 2009), Stata® (StataCorp 
2010), and SUDAAN® (Research Triangle Institute 2008). 
However, limited work has been completed on variance 
estimation for EC poststratification. 

Four EC variance estimators for be that account for the 
variance in the control totals are presented in the following 
subsections after defining the population sampling variance. 
They include one newly developed linearization variance 
estimator, and three delete-one-PSU (delete-one) jackknife 
variance estimators. With the delete-one jackknife, repli- 
cates are created by sequentially deleting one PSU and 
adjusting the weights for the remaining PSUs within the 
corresponding design stratum. This results in a total of 
m, = dj/,;m,, replicates calculated by summing the num- 
ber of analytic-survey PSUs per stratum (™m,,,) across the H 
Siratan (isle le) 

An effective variance estimator will reproduce the 
corresponding population sampling variance in expectation. 
The approximate (or asymptotic) population sampling 
variance of fp = N’,Y, has the following form: 


AV(?,p) = N,V,N, +2¥,Cov(N;,¥,)N, +¥,V,¥, 
=N,V,N,+ YiN5¥, (5) 


where N, = E (N, ), a vector of expected values for the 
benchmark poststratum counts within the G_ poststrata; 

= (Np,..., Nag) is a G-length vector of control totals 
nies ore the benchmark survey; Y, is a G -length 
vector with population components of the form ),, = 
tyye/N4o3 V4 is the population (variance-)covariance matrix 
of the estimated components of the vector Y,; and V, is 
the covariance matrix of the G benchmark control 
estimates N,. The first component, N‘,V,N,, is the 
approximate variance for the traditional poststratified 
estimator f,p., i.e., the benchmark estimates are treated as 
fixed. The component, YV,Y,, is the variance associated 
with the benchmark estimates conditioned on the analytic 
survey sample; this is the EC poststratification variance 
component. Because we assume that the analytic and 
benchmark surveys are independent, the covariance of 
estimates from the two surveys is, by definition, zero. 
Hence, the component Cov(N,,¥,) above is eliminated 
from the expression. 

Krewski and Rao (1981), Rao and Wu (1985), and others 
demonstrated the asymptotic consistency of the linearization 
and jackknife variance estimators for nonlinear functions. 
However, this examination needs to be extended to the EC 
poststratification. We discuss the set of EC variance 
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estimators for the population sampling variance below 
identified or developed for our research. The sample 
estimators were calculated by substituting sample estimates 
for the corresponding variance parameters. We begin with 
an evaluation of a traditional or naive poststratified variance 
estimator that does not account for the variation in the 
estimated controls. 


4.1 A traditional variance estimator for EC 
poststratification (Naive) 


A variety of variance estimators have been developed for 
poststratification estimators. With all of the methods, the 
controls are assumed to be fixed and known without error. 
Therefore, Y'V,Y,, the second (positive) component in 
expression (5), is zero because V, = 0 by assumption. The 
linearization variance estimator has the form 


ValNaive Oe ) = N B V, N B (6) 


where N, is the vector of the G benchmark control total 
estimates, and V,, is the estimated covariance matrix of the 
estimates Y, = (fy)/Nas ---s byg/Nyg)- Because the 
second component in the second line of (5) is not estimated, 
any variance formula developed for traditional post- 
stratification will by definition underestimate the population 
sampling variance. However, highly precise benchmark 
estimates may contribute a negligible EC-poststratification 
variance component to the overall estimate. Thus, the 
difference between the estimates for traditional and EC 
poststratification will for these situations also be negligible. 


4.2 Taylor series linearization (ECTS) 


A linearization variance estimator for the 7,, has the 
form: 


Valects G 55 de N LVN Bt Y\V;, v7 (7) 


where V,, is the estimated benchmark covariance matrix for 
the set of G control totals. The remaining terms are defined 
for expression (6). The ECTS formula is a function of the 
variance under traditional poststratification and an additive 
inflation term associated with the variation in_ the 
benchmark controls, i.2., Vatgcrs (Fp) = VatNaive yp) + 
AUN . 

Ideally, the benchmark survey analysis file would be 
available to calculate the values for V,. However, 
researchers may have to rely on published estimates for only 
the marginal control totals, i.e., point and variance estimates 
by one characteristic instead of the counts and covariance 
estimates for a set of characteristics. The implications of 
having limited information are discussed further in 
Section 4.4. 
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4.3 Fuller two-phase jackknife method (ECF2) 


Isaki, Tsay and Fuller (2004) applied a two-phase delete- 
one jackknife variance estimator developed by Fuller (1998) 
to an EC poststratification situation. The premise behind 
Fuller’s methodology (ECF2) is to take a_ spectral 
(eigenvalue) decomposition of the benchmark covariance 
matrix (V,)s develop benchmark adjustments that are a 
function of the resulting eigenvalues and eigenvectors, and 
add the adjustments to the vector of benchmark controls 
(N z) to create a set of replicate controls. A randomly 
chosen subset of the m, replicates is poststratified to the G 
constructed replicate controls where the total number of 
PSUs must equal or exceed the number of poststrata, i.e., 
m, 2 G. Specifically, the benchmark control total for the 


r” replicate is defined as 


Nay = N, + Gp LZ, (8) 


(r) 


ay) G as =s bs 
where Z/,) = 8) Le19 gr Z—3 Cy = My,(my, —1), a 


constant related to the delete-one jackknife variance 
method; 6,,) is a zero/one indicator that identifies the G 
(out of m,) randomly chosen replicates to receive an 
adjustment; 5,,,) =1 if the g" component of the 
benchmark covariance decomposition is randomly chosen 
for the assignment given that replicate r is selected for 
adjustment; and z, = q, i z> 4 function of an eigenvector 
( q =) and the associated eigenvalue (A) where 
V; = Le-1Z,%,, by definition. Thus, given that 5,,. = 1 
for a particular replicate, a single indicator 6,,,,. must also 
equal one; however, if 6,,) = 0, then a// indicators 6 
equal zero. 

The delete-one jackknife can take multiple forms 
depending on the centering value. We chose the somewhat 
conservative variance estimator centered about the full- 
sample estimate for our research (v, in Wolter 2007, 
section 4.5). The delete-one jackknife variance estimator, 
Valcr) (t,p), is calculated as follows under the Fuller 
method for a stratified, multi-stage design. 


g\(r) 


Map 


H 
“ Onis) we eee 
Valecr2 i> - »Y (ypc) — tp) 


h=1 M 4p, r=! 


H Man, 
Sze »y Gro) hyp Foy Dioyy (9) 
hal Mg, rl 
where the terms in (9) are defined below. Note that the 
association of the r” replicate to a particular design stratum 
is defined through the stratum membership of the eliminated 
PSU. The replicate estimates in (9) are defined as 
b pye(r) Sn ena Dares Of ae, and WN) = 
Dn Lies,, Gi(r) Lkes,,, Ogx4;ys Where the PSU-subsampling 
weights are calculated as 


Statistics Canada, Catalogue No. 12-001-X 


50 Dever and Valliant: A comparison of variance estimators for poststratification to estimated control totals 


0 it f=1; resy 
ifh¢h' forres,,andies ,, (10) 


m,,/(m,,—1) ifrsi buth=h. 
The remaining terms in (9) are Wane = ~ i> NG the 
estimated mean of the outcome variable within poststratum 
g and replicate r; 
ee 
g=l 


vP(r) 


Pe OF (11) 


ee Ayg(r) 


a function of replicate estimates with N zor) defined as the 
g'" component in expression (8); f,p,,) is the replicate 
estimate under traditional  poststratification, namely 
YEN aCe! Nagiy)s and f,, is the estimated total 
given in expression (3) calculated from the complete sample 
file. Squaring the terms in (9) results in a variance 
component conditioned on the benchmark controls, a 
component due to the benchmark control variability, and a 
cross-term of lower order that is approximately equal to zero 
in expectation. The design-expectation of the resulting 
jackknife variance estimator is asymptotically equivalent to 
AV(¢,») in (5) only if the respective components are 
calculated with values from design-consistent estimators. 
Fuller (1998) also demonstrated that the jackknife variance 
of the replicate controls, varzc;,(N,), reproduces the 
estimated benchmark covariance matrix V for every 
sample. 

Currently no software exists to calculate the ECF2. The 
six steps needed to calculate var...) (f,») using any 
appropriate programmable package are as follows: 


1. Calculate the 
expression (3). 

. Determine the G eigenvalues he 
q,.. for Ve» and calculate the replicate adjustments 
Ze = Gy iNew Concatenate the Gx G matrix of 
z,’s with a Gx(m,-—G) matrix of zeros, and 
randomly sort the columns. Call this new Gx m, 
matrix Z. 

3. Calculate a vector of length m, with values equal to 
C, =.{ m,,/(m,,—1) ordering from h =1 to H. 
Populate each row of a Gx m, matrix, called C, 
with this vector, i.e., the row values are repeated . The 
ma-length vector of jackknife stratum weights, W,, 
is created with components equal to (m,, —1)/m,, 
where the deleted PSU is extracted from stratum h. 

4. Calculate the Hadamard (or element-wise) product 
(Searle 1982, page 49) of Z and C denoted as 
ZC. Replicate the vector N, into the columns of 
a Gxm, matrix and add to ZeC. This new 
Gx m, matrix, called N,,, contains the replicate 


full-sample estimate 7, using 


i) 


and eigenvectors 
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benchmark controls discussed in expression (8) for all 
m, replicates. 

5. Calculate the replicate — estimates Vagtr) < 
fryer)! N tecr) BY Femoving in-turn one PSU from the 
analytic survey sample file, adjusting the weights for 
the remaining PSUs (W, values), and summing the 
weighted values for the numerator and denominator 
within poststratum g. Call the resulting G x m, 
matrix Y,. 

6. Calculate the m, replicate estimates, fpr)» by first 
multiplying the elements N Peay Y, and summing 
down the rows within a column. Next, subtract 7), 
from each of the m, values and square the terms, 
multiply by the PSU-subsampling weight adjustments 
specified in (10), and sum across the m, estimates. 
The resulting value is the estimated variance using the 
Fuller method, vargcp,(f,p)- 


4.4 Nadimpalli-Judkins-Chu jackknife method 
(ECNJC) 


Nadimpalli et a/. (2004) developed a delete-one jackknife 
variance estimator that randomly perturbs the control totals 
for the complete set of replicates instead of adjusting only a 
subsample of replicates as discussed for the ECF2. The 
benchmark survey replicate control totals have the following 
form: 


A 


Ne = N,; +c,R,S8, 1) (12) 


where” c, =./ mi /(n, —1),. -as with athe SBCE2: 
R, =Jl/(Hm,,), a function of the total number of 
analytic-survey strata (H) and PSUs (my;); S, is a 
diagonal matrix of estimated standard errors for the 
benchmark controls; and 7,, is a G-length vector of 
values randomly generated for each replicate from the 
standard normal distribution. The remaining terms are 
specified for the ECF2 following expression (8). Note that 
the covariance estimates included in the ECF2, i.e., the off- 
diagonal values of Vie are set to zero for the ECNJC. 

The corresponding delete-one jackknife variance 
estimator of the poststratified total is calculated as follows: 


(m4,—\) & 2 
Valecnic (t,p) = = 5 aD # ss (pq) op 


fates Eee ral 

(my-)) % 
-y Mur? Sy Csem Lp 
h= M 4p, fecell 


aF Gin SAB eo (13) 


where 7, P(r), is computed as described for the ECF2 in (11) 
but with N,,,,, defined by the g" component in (12). 
Unlike the ECF2, the sample variance of the ECNJC 
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replicate controls given in (12) reproduces the benchmark 
covariance matrix V, in expectation only if the covariance 
terms are truly zero (see Appendix A for details). If V, is 
not diagonal, var,.\). fails this test. 

Use of the ECNJC would be plausible in two cases: (i) 
the complete benchmark covariance matrix for the controls 
is unavailable (e.g., estimates taken from a previous report), 
or (ii) the covariance terms are negative so that the resulting 
values defined by (12) would lead to conservative variance 
estimates. The diagonal matrix for S, would be correct if 
the estimated poststratum counts were actually uncorrelated. 
However this is unlikely because of the multinomial 
structure of N,. Given the setup for the ECNJC, the 
expectation of the variance estimator will not approximate 
AV (typ ) in (5); the bias term is related to the difference 
between the design expectation of S; and V,. 


4.5 Multivariate normal jackknife method (ECMV) 


The multivariate normal method (ECMV) is a 
generalization of the ECNJC and to our knowledge is first 
discussed in this paper. The ECMV uses the complete 
covariance matrix V; and relies on large-sample theory so 
that the control total adjustments may be modeled as 
coming from a G-dimensional multivariate normal (MVN) 
distribution. The replicate controls for the ECMV have the 
form 

Nay = Ng t+ & Ry &i (14) 
where E0),.. is a G-length vector of random variables such 
that &,) ~ MVN,(0, V,); GG ET AD 
Fe aes | TE Gee 

The delete-one jackknife variance estimator for the 
ECMV is calculated as 


(m 4,- I) x ; yg 
Valecmy (f, me y ud : GRRE typ) 


h= M 4h r=l 
eee a 

vP(r) typ 
= M 4p r=l 


+6,R,E)B 49) > (15) 


where f,p;,. 1s computed as described for the ECF2 in (11) 
but with WN ner) defined by the g" component in (14). 
Unlike the Fuller method, vatzeyy (N;) # Vz; instead, the 
ECMV must rely on the design-based properties of the 
estimator. The design expectation of this estimator is 
evaluated with respect to the MVN distribution conditioned 
on the benchmark estimates (£,), and then with respect to 
the benchmark survey design (£,). As shown in 
Appendix B.1, 


Eel Eval pay (N;))B)| = Es (Vp). (16) 
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If V; is an approximately unbiased estimator of V,, 
then the population covariance matrix is reproduced with 
this method. 

Under the Fuller two-phase method, Var[vatecp)(N a= 
Var(V,) because Valscps (N *) VE. To compare ECF2 
and ECMV further, note that if we define y, = 1 in the 
analytic survey, then /, = 1'N,. As shown in 
Appendix B.2, 


Var [vatzoyy (l'N,)] = 


Var, (1'V,,1] + = [E,(1'V,1)"] > Var,[1'V,1] (17) 


MS) 

where mm, is the harmonic mean of the PSU sample sizes 
per stratum in the analytic survey. This suggests that the 
Valecp, and the varzcyy have similar large sample 
expectations, though in practice the ECMV is likely to be 
more variable than the ECF2. We examine this issue 
through a simulation study described in the next section. 


5. Description of simulation study 


We complement the theoretical evaluation of the five 
variance estimators discussed in the previous section with 
an analysis of simulation results. 


5.1 Simulation parameters 


The simulation population is a random subset of the 2003 
National Health Interview Survey (NHIS) public-use file 
containing records for 21,664 adults. These records were 
divided into 25 strata, each containing six PSUs. Samples 
were selected from this “population” using a two-stage 
design. Two PSUs were selected with replacement using 
probabilities proportional to the total number of adults (PPS) 
within the PSU. From within each sample PSU, we selected 
simple random samples of (7,,, =) 20 and 40 persons 
without replacement giving total sample sizes of 1,000 and 
2,000, respectively. Two within-PSU sample sizes were 
considered for this study to evaluate the effects of smaller 
analytic survey variance components, calculated by 
increasing n,, on the variance of 7,,. For each 
combination of PSU and person-level samples (ie., 50 
PSUs and either 1,000 or 2,000 persons), we selected 4,000 
simulation samples. We calculated the estimated population 
totals and associated variances for two binary NHIS 
variables: NOTCOV = | indicates that an adult did not have 
health insurance coverage in the 12 months prior to the 
NHIS interview (approximately 17 percent of the 
population); and PDMED12M = | indicates that an adult 
delayed medical care because of cost in the 12 months prior 
to the interview (approximately 7 percent of the population). 
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We exclude nonresponse from consideration in our current 
simulation study to minimize factors that might affect our 
comparisons. (Note: The interview questions for these 
variables can be found in the family core instrument at 
ftp://ftp.cde.gov/pub/Health_Statistics/NCHS/Survey_Quest 
ionnaires/NHIS/2003/qfamilyx.pdf. Responses from ques- 
tions FHI.070 and FAU.010/FAU.020 were used to 
generate the variables NOTCOV and PDMED12M, 
respectively). 

Poststratification may reduce variances slightly. How- 
ever, in household surveys, this technique is mainly used to 
correct for sampling frame undercoverage, as well as other 
problems inherent with surveys. Each of the 4,000 
simulation samples was selected to mimic a sampling frame 
for the analytic survey that suffers from differential 
undercoverage, such as those used for many telephone 
surveys. Sixteen (G =16) poststratification cells were 
defined by an eight-level age variable crossed with gender. 
The coverage rates for the 16 cells were created based on 
the population means for each age group by gender and 
range in value from 0.5 to 0.9. A coverage rate equal to 1.0 
would indicate full coverage. Before each sample was 
selected, the frame was designated as a stratified random 
subsample of the full population of 21,664. For example, 90 
percent of the male population 65-69 years of age was 
randomly selected to be in the sampling frame for the 
NOTCOV simulations. This process of subsetting the 
population to the frame was independently implemented for 
each sample and for each outcome variable. 

We suspect that the decision for researchers to use either 
a traditional or an EC poststratification variance estimator 
depends on the precision of the control totals. We calculated 
the benchmark covariance matrix (V,) from the complete 
NHIS public-use data file (92,148 records) and ratio 
adjusted the values to reflect a sample size comparable with 
our simulation population (N= 21,664). The off-diagonal 
values of V, range from -0.05 to 0.75 with a mean value of 
0.22. From this matrix we calculated four covariance 
matrices for the simulation by dividing the original matrix 
by the adjustment factors 1.0, 3.6, 18, and 72. The 
adjustments reflect benchmark surveys with an approximate 
effective sample size of 21,700, 6,000 (~ 21,700/3.6), 1,200, 
and less than 500, respectively. 

The simulation was conducted in R™ (Lumley 2009; R 
Development Core Team 2009) because of its extensive 
capabilities for analyzing survey data and efficiency with 
simulated analyses. Code was developed to calculate the 
linearization and replicate variance estimates for the EC 
poststratified estimator discussed above because the relevant 
code does not currently exist. 
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5.2 Evaluation criteria 


The empirical results for the five variance estimators 
discussed in the previous section (Naive, ECTS, ECF2, 
ECNJC, and ECMV) are compared using three measures 
across the j = 1,...,4,000 simulation samples, and the two 
outcome variables (NOTCOV and PDMED12M). The 
measures include: (i) the estimated percent relative bias of 
the variance estimator, (1/4,000>  var(7,» ) — mse)/mse 
where var(/,,) is one of the five variance estimates 
evaluated for sample j and mse is the mean square error of 
i,» defined below; (ii) the 95% confidence interval 
coverage rate, 1/4,000% ,1(|2;|S 2-02) where z,= 
(ip =a) | var(?,p ); and, (iii) the standard deviation of 
the estimated standard errors, calculated as the square root 
of 1/(4,000-1)>,( [var (fp) — 1/4,0003, [var (fp ))°- 
The relative bias and the root mean square error of our 
point estimators are calculated as 1/4,000 Li Ge mat, tg 
and /mse = [1/4,000%.. (2 — t,)', respectively. 


6. Simulation study results 


6.1 Point estimator 


To justify the need for poststratification, we initially 
evaluated the Horvitz-Thompson estimate (X, d,y,) for 
the two outcome variables. This estimator is known to be 
design-unbiased under pristine conditions. The percent 
relative bias indicates that the HT estimator is negatively 
biased, underestimating the population total by 38 percent 
for NOTCOV and 41 percent for PDMED12M. These large 
values show that some correction is needed to adjust for the 
non-negligible levels of bias. The percent relative bias for 
the poststratified estimator 7,, was much lower — the 7,, is 
positively biased by no more than two percent for both 
outcome variables. 


6.2 Variance estimators 


Adding to the theoretical evaluation discussed in Section 
4, the empirical results for an effective variance estimator 
should possess a percent relative bias either near zero or 
somewhat positive for a conservative measure (see Section 
5.2 for the formula of the percent relative bias). 

The percent relative biases generated from our simulation 
study are provided in Table 1. Bias estimates for the Naive 
and ECNJC variance estimators are larger than for the other 
EC estimators for all our simulations. Estimates for the 
ECTS are somewhat smaller than the values calculated for 
the ECF2 and ECMV estimators for relatively small 
benchmark surveys. However, the differences are negligible 
as the size of the benchmark survey increases. 
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Percent relative bias estimates for five variance estimators by outcome variable and relative size of the benchmark survey to the 


analytic survey 


Variance Relative Size (n, = 1,000) Relative Size (1, = 2,000) 
Outcome Variable Estimator 0.3 1.2 6.0 Dee 0.2 0.6 3.0 10.8 
NOTCOV Naive -50.3 -23 -10.7 -9.2 -56.0 -31 -14.2 -12.2 
ECTS -4.5 -4.5 -6.1 -7.7 -0.2 -8.4 -8.2 -10.1 
ECF2 -4.7 -4.6 -5.8 -7.5 0.1 -8.2 -8.3 -10.1 
ECNJC -36.7 -17.1 -8.9 -8.2 -40 -24.2 -11.9 -11.1 
ECMV -4.3 -4.1 -6.0 -7.5 -0.2 -8.1 -8.1 -10.0 
PDMED12M Naive -34.4 -14.5 -5.7 -3.9 -48.1 -23.4 -10 -10.1 
ECTS -3.3 -3.7 -2.7 -2.6 -4,.7 -6.4 -5.] -7.8 
ECF2 -3.5 -3.5 -2.4 -2.3 -4.6 -6.8 -5.2 -7.8 
ECNJC -24.5 -10.5 -4.0 -2.7 -35.1 -17.6 -7.6 -8.4 
ECMV -3.0 -3.3 -2.4 -2.2 -4.3 -6.3 -5.0 -7.7 


The traditional poststratified estimator (Naive) was most 
negatively biased among those compared as expected. 
When the benchmark survey is smaller than the analytic 
survey (and therefore produces estimates less precise than 
the analytic survey), the Naive estimator is negatively 
biased by as much as 56 percent. The level of bias improved 
as the relative size of the benchmark survey increased; 
however, the Naive estimator still resulted in, at best, a four 
percent underestimate. The ECNJC estimator fared slightly 
better than the Naive estimator though the bias (-2.7 to -40 
percent) is still larger than the other EC variance estimators, 
which range between -10.1 and 0.1 percent. 

For a small benchmark survey relative to the size of the 
analytic survey (i.e., relative size less than one), the levels of 
(absolute) bias dramatically increased for the Naive and 
ECNSJC estimators. The opposite effect is noted for the other 
EC variance estimators. The variance component associated 
with the benchmark survey, eg. Y',V, Y, shown for 
Vaf-crs in (7), becomes the dominate term within the EC 
variance estimators as the precision of the benchmark 
survey estimates decreases. Thus the benchmark variance 
component somewhat corrects for the underestimation 
associated with the analytic variance component. Additional 
research is needed to determine if a threshold exists for 
when such a counterbalance of bias can occur. The overall 
negative bias of our estimates is similar to the bias of 
linearization variance estimators as shown in another 
context by Rao and Wu (1985, section 4) and Wu (1985). 
However, further research is also needed to determine how 
to minimize the underestimation. 

Note that the relative sizes of 21.7 when n, = 1,000 and 
10.8 when n, =2,000 both imply benchmark survey 
sample sizes of about 21,600. Thus the O(M’/m,) 
component of the variance, Y',V, Y,, is more prominent 
for the estimates in Table | based on n, = 2,000. This leads 
to larger relative biases in these estimates, relative to those 
produced under 7, =1,000, even though the analytic 
survey sample size is larger. 


The patterns exhibited for the percent relative bias are 
reflected in the coverage rates for the 95 percent confidence 
intervals for the estimated totals but are not provided for 
sake of brevity. The Naive and ECNJC estimators are more 
likely to experience confidence intervals coverage rates 
below 95 percent. These rates approach the appropriate 
level as the precision of the benchmark survey estimates 
improves. However, the remaining EC variance estimators 
had coverage rates near acceptable levels regardless of the 
relative size of the surveys and therefore are more robust. 

The discussion so far suggests that there are minimal 
theoretical, as well as empirical, differences between the 
ECTS, ECF2, and ECMV methods. We finally look to the 
standard deviation of the estimated standard errors (SEs) in 
an attempt to distinguish the estimators. An examination of 
this variability can provide insight on the (empirical) 
stability of the variance estimators, i.e., an unstable variance 
estimator could generate a poor variance estimate based on 
the nuances of a particular sample. Table 2 contains the 
percent relative increase in the standard deviations for the 
ECF2 and the ECMV both in comparison to the ECTS. 

The variation in the ECMV variance estimates was 
noticeably larger than for ECF2 but only for relatively small 
benchmark surveys. The difference increased as the size of 
the analytic survey increased. This suggests that the ECF2 
may be preferred over the ECMV due to increased stability 
in the variance estimates. However, further research is being 
conducted on the threshold for when the instability can 
affect the estimates. 


7. Conclusions and future work 


The theoretical and analytical work discussed in this paper 
support the need for a new methodology to address post- 
stratification using estimated control totals, ie., estimated- 
control (EC) poststratification. Traditional variance estimators 
can severely underestimate the population sampling variance 
resulting in, for example, incorrect decisions for hypothesis 
tests and sub-optimal sample allocations when the design is 
implemented in the future. 
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Table 2 


Percent increase in instability of variance estimates relative to the ects by outcome variable and relative size of the benchmark survey 


Variance Relative Size (1, = 1,000) Relative Size (n, = 2,000) 
Outcome Variable Estimator 0.3 1.2 6.0 72 Ee 0.2 0.6 3.0 10.8 
NOTCOV BER? 12.0 SS) Jess) 0.2 15.1 8.4 Ar 0.6 
ECMV PN) 7.4 1.8 0.3 30.8 8.5 2.4 Ore 
PDMED12M BGE2 Tell 3.8 iba 0.4 12.0 6.3 Hl O)7/ 
ECMV ES 4.0 0.9 0.5 22.6 7.6 Pips 1.1 
The EC linearization variance estimator varzc;g in Appendix A 


expression (7) shows promise for EC poststratification. This 
estimator is especially effective at reducing the percent 
relative bias experienced with the Naive variance estimator in 
(6) when the benchmark survey is small relative to the 
analytic survey. The replication variance estimator var,c;, 
given in (9) is recommended specifically for studies requiring 
replicate weights such as when public-use analysis files are 
released without sampling design information to further 
protect data confidentiality and respondent privacy. The 
alternative replication estimator var,,,, also performed well 
and is somewhat easier to implement than var,.,5. 

Implementation of the recommended variance estimators 
requires specialized computer programs because the 
capabilities are currently not available in standard software. 
The linearization estimator may be more approachable 
because implementation involves a modification to available 
variance estimates, ¢.g., Vatgcts(fecps) = ValNaive (Eyecps) + 
Y',V, Y,. We provide a step-by-step discussion of the 
procedures required for the var,.,, (see Section 4.3) to 
facilitate the creation of the computer program. 

Extensions to this research to be presented at a later date 
include a generalization to linear calibration, to other 
statistics including a ratio-estimated mean, and to domain 
estimation. We additionally are investigating whether 
threshold values are identifiable which determine (7) when 
there are negligible differences between traditional and EC 
variance estimation, and (ii) when the benchmark controls 
are too imprecise to use for calibration. We also plan to 
investigate the theoretical implications of measurement 
errors in the analytic as well as the benchmark surveys. 
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Derivation of var,cyjc(Nz) 


For the following derivations, let £, represent the 
expectation with respect to a standard normal distribution. 
All other terms are defined in the body of the paper. 


R Le Te ee R a et: 
Valgcnic (Ng )= a> NG Nz )(Nay— Ng) 
n=l Mg, =I 
es Speke ¥K S 
rah B = M4, = (r) B 
where K,,) = %,M%,)» 4 GX G cross-product matrix of 


standard normal values; and S%, = diag(V,). Because 
E.(K,,)) =1,, a G-dimension identity matrix, we have 
E.[vatgcyic(N,)] = diag(V,). Therefore, vatcyic(N,) 
does not reproduce V, in expectation. 


Appendix B 
Evaluation of the ECMV 


For the following derivations, let E,, and Var, represent 
the expectation and variance with respect to the benchmark 
survey sampling design. Also, let E, and Var, represent the 
expectation and variance with respect to the G-dimensional 
multivariate normal distribution, MVN,(0,V,). All other 
terms are defined in the body of the paper. 


B.1: Derivation of E[var,cyy (N,)I given in (15) 
Using expression (14) and c,; = m,,/(m,,— 1), 
H 


E[vatecyy(Np)] = Ep fe(p ea 


hal My, 
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i| Man 
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B.2: Derivation of Var[var,cyy (N,)I given in (15) 


When y, =1 so that fp = IN, Valecyy 'Ns) = 
HS my, Dr41'8,8,1. Using the formula for the 
variance of a quadratic form (Searle 1982, section 13.5), we 
have 


Var [Vatcuy (1'N, )] 
= vas 2 std > 2,1’ 6,110) 


Hone 1M gy r=l 
stl) 


+6,| 45 vie i Man (en, 


rah oer 


= vin] EF S41 


arr ipo 


ale, - 
_ beh —- 


jaan ,11%)| 


, @ ares 
= Var.|tV,1)> - [E,(1'V,1)°], 
Him, 
is the harmonic mean 


+e Slo | =15=1 
where, = (HO >i) )M4,) 


of m 4). 
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Some contributions to jackknifing two-phase sampling estimators 


Patrick J. Farrell and Sarjinder Singh ' 


Abstract 


In this paper, the problem of estimating the variance of various estimators of the population mean in two-phase sampling has 
been considered by jackknifing the two-phase calibrated weights of Hidiroglou and Sarndal (1995, 1998). Several estimators 
of population mean available in the literature are shown to be the special cases of the technique developed here, including 
those suggested by Rao and Sitter (1995) and Sitter (1997). By following Raj (1965) and Srivenkataramana and Tracy 
(1989), some new estimators of the population mean are introduced and their variances are estimated through the proposed 
jackknife procedure. The variance of the chain ratio and regression type estimators due to Chand (1975) are also estimated 
using the jackknife. A simulation study is conducted to assess the efficiency of the proposed jackknife estimators relative to 


the usual estimators of variance. 


Key Words: Auxiliary information; Calibration; Estimation of mean and variance; Jackknife; Two-phase sampling. 


1. Introduction 


Hidiroglou and Sarndal (1995, 1998) have pointed out 
that two-phase sampling for the estimation of finite popu- 
lation attributes is a powerful and cost-effective technique, 
and hence plays an eminent role in survey sampling. Two- 
phase sampling can be described as follows. Consider a 
finite population that we shall denote by Q = {l, 
2,...,1,..., N}. Suppose that information is available on a 
variable Z across the entire population; that is, the values 
Z, for all i = 1, ..., N, are known, implying that the popu- 
lation mean, Z, is also known. A first-phase probability 
sample s,, s, < Q, of size m is drawn from the population 
with selection probabilities ,,. Thus, the first-phase 
sampling weights can be defined as d,, = 1/7,,. Assume 
that for this sample, information is collected on a variable 
X, which is then paired with the information on Z for each 
of the m units, giving rise to the data {(x,, z,)| i € s,} for 
i=1,..., m. Once the first-phase sample s, has been 
drawn, a second-phase sample s,, s, C s,; C Q, of size n 
is selected from s, with selection probabilities 1, = 7,,, 
allowing for the second-phase sampling weights to be 
defined as d,, =1/7,, In the second-phase sample, 
information is now collected on a variable Y for each 
selected unit. This information is linked to that previously 
available on Z and X for these units, giving rise to the 
data. {(x;,. ¥°2Z;)|4-€ s)} for i=1.....,.n. Suppose. that 
interest lies in estimating the population mean Y, and on 
the variance of the estimator employed. 

Let wi; = d,;/ Dies, 4; denote the first-phase normalized 
original design weights. The usual estimator of the 
population mean Y is given by 


while a calibrated first-phase estimator of XY is 


ye rt Ce as 
AS Die 


ies; 


where the wj, are calibrated weights such that the chi- 
square distance function 


Di = Cy = wi) Mw a )}> 


ies, 


(1.1) 


is minimized subject to 


(22) 


In (1.1), the g,,; are a set of suitably chosen weights. 
Minimization of (1.1) subject to (1.2) leads to the first-phase 
calibrated weights 


Wii = Wi; {Ga Wi; Zi {Za Wi; Zi (2-2 Wi; Z; } 


Ty ES, 


Thus, a first-phase calibrated estimator of X is given by 


Ke ae Sie x, af kar a ai) 


eS; 


where 


B, oh (24 Wi; X; 2; \/(24 Mii “| 


1ES) 1ES) 


Now, let w5,= d)d);/Dies, 44; denote the second- 
phase normalized design weights. The usual estimator of Y 
is given by 
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= Lewes Yr 


1ES5 


we 


Let us consider the second-phase calibrated estimator of Y 
as 


(133) 


= 2 Ip 


where the w5, are the second-phase calibrated weights such 
that the chi-square distance function 


b= = 20%, — ws)? / (WS, do; )}5 (1.4) 
is minimized subject to the calibration constraint 
ee ee (1.5) 


i€S5 


Minimization of (1.4) subject to (1.5) leads to the second- 
phase calibrated weights 


£ = Oo 
W a W); 


ar (aa ae 1p iW Di X; ‘| pi? Ws; %) 


Thus, the second-phase calibrated estimator of Y specified 
in (1.3) can be written as 


On ~o i 0 Ores (a 
where Ee = Lies Wi 2 X] = Lies Mi Xj> X27 = Lies, Wr; X}> 
oO 

= Lies, w;, Y;, and 


B, ae 9; W3,X; vi (= ies ae } 


ies 1ES> 


ope 


Hidiroglou and Sarndal (1995, 1998) and Singh (2000) 
have considered the problem of estimating the variance of 
the calibrated estimator Y° in (1.6) by using a design-based 
approach. In a more general context, Rao and Sitter (1995) 
and Sitter (1997) have pointed out that under simple random 
sampling without replacement (SRSWOR), a jackknife 
technique can be used to estimate the variances of the ratio 
and regression estimators for a population mean. These 
authors have also reported that the use of the jackknife for 
estimating variance is more convenient and efficient than 
the traditional techniques based on estimates of moments. 

Of late, a number of authors have investigated the use of 
jackknife procedures for estimating variances (See Arnab 
and Singh 2006, Berger 2007, Berger and Skinner 2005, 
Chen and Shao 2001, and Kovar and Chen 1994). Fuller 
(1998), Kim, Navarro and Fuller (2000, 2006), Kim and 
Sitter (2003), and Kott and Stukel (1997) have suggested an 
approach for estimating the variance in two-stage sampling. 
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Fuller (1998) and Kim and Sitter (2003) address the regres- 
sion estimator. In particular, consider the generalized regres- 
sion estimator of icierwes total 


= Law, 


due to Deville and Sarndal (1992). Following Kim ef al. 
(2000, 2006), for each k € s,, specify the jackknife esti- 
mator of population total as 


(ie) 


and the chi-square distance between the design and 
calibration weights as 


0) ae al? —w 


ies, \k 


yew Niel. (wit )sae (ee) 


Minimizing (1.8) subject to the condition 


k k 
y ax, = > x, 


ies, \k ies, \k 


leads to jackknifed calibrated dks given by 


k Ten (ic) k k k k 
at ) = ws wy, ( + \(w'"q ) : ud) ws g ‘ ) 
4», 


y wx, — a We Wee 
i€sy\k ies, \k 

It would appear that Kim eta/. (2006) readjusted these 

weights as 


Maes: 
a nwifek es; 


wiht, - ioe (Sies8.)- 


For such a readjustment, the estimator in (1.7) is equivalent 
to that of Rao and Sitter (1995). 

In the present paper, we consider a new jackknife 
technique to estimate the variance of the estimator Y° under 
the two-phase setup by following Hidiroglou and Sarndal 
(1995, 1998). Similar to Kim ef a/. (2006), the estimator 
proposed by Rao and Sitter (1995) is shown to be a special 
case of the proposed method. However, our approach differs 
from that of Fuller (1998) Kim and Sitter (2003), Kim ef al. 
(2000, 2006) in that we consider calibration at both the first 
and second phases, thus allowing for the development of the 
technique for chain ratio and chain regression type esti- 
mators. We also investigate, via a simulation study, the 
efficiency of the jackknife estimators of variance relative to 
the usual estimators. 


2. Estimation of variance using jackknifing 


In what follows, we assume that a single stage design is 
employed at both of the two phases in the sampling process. 
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Let Y° (j) be a calibrated estimator of the population mean, 
Y, obtained by dropping the j" unit from the sample Sol 
m units. We prove in the Appendix that the jackknife 
estimator of the population mean in two phase-sampling can 
be written as 


Yo G)EB Gertie <: U)} 


+BGBUZ-Z7()} if jes, 
eG (2.1) 
Pores ts) a, | 
+R) BAZ-Z7G)} if fe(s,-s,) 
where the quantity Z/(j) = Z? + (wf, —w?,)}{Z? - 


z,}, the terms X/(j), X3(j), and Y,’(/) are defined in 
an analogous manner, B,(/) = =B, +{41,"1)2; (x,;—B,2Z,;)}/ 
= sige and B, (j) = = Bs + fap, Ws; , X ut 
(y, Ene, x, BNA; Wy, X aa Be IoiWr; a The modified 
jackknife estimator of variance of Y° is then given by 


Vince ¥°) = {mn -VY/ my > F°()- YP. 22) 


Sy 


We show in the appendix that this estimator is consistent. 
Note that we can write that 


&,(/)+ Boe, (/)+ B (a, (/) 


A 


¥(j)-¥* =) + B,8,(/) if jes, (2.3) 
B,€,(/) if fe(s,=s,) 
where the terms in (2.3) are given _by € NO age = 


(X) (j)-X?}- B,i){Z? (/)-Z}, & (= (Ye ()- ¥}- 
B,(A){X3(y) - X3} Staite (/){Z? Gy 2)5 4; UN 
(X? (7) — X3(j)} and 8,(j) ={X3(j)-X7}-B, IZ 
Zo Gye B, {Z - vie The e€,(/) term is analogous to the 
error term associated with the regression of the auxiliary 
variable x, on z,, for i € s,, while €,(j) is analogous to 
the error term associated with the regression of the study 
variable y, on both x, and z, simultaneously, for i € s,. 
Provided that 7 € s,, the d,(/) term reflects the difference 
in the jackknife first and second phase sample means for the 
variable X, while 5,(/) denotes an adjustment to d,(/) 
obtained by using information on the auxiliary variable Z. 
Using (2.3) in, (2.2), the jackknife estimator of variance 
of the estimator Y° is given by 
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ervard OED) = {(m —1)/m} 


» (y+ VR()a2/) 


JES, JES, 


+ 85 > 8,(/) {5,(/) + 2e,()} 


JES} 


+ 28, > &,(/)e.V) 


JES, 


+ 28, DIB. (/) (/){e(/) + 8.(/)} 


+B mei} 


JES, 


(2.4) 


Note that the expression given in (2.4) is exact. It can be 
used to estimate the variance of several estimators available 
in the literature. 


3. Special cases 


In the next section, we demonstrate that the estimator 
proposed by Rao and Sitter (1995), Sitter (1997), Raj 
(1965), Srivenkataramana and Tracy (1989), Chand (1975), 
and Ahmed (1997) can be viewed as special cases of the 
proposed technique. 


Case 3.1: Rao and Sitter (1995) 


Pe ; = X/ (no first-phase calibration is made) and 
qo, = 1/x,, then the calibrated estimator of Y becomes 


t= (Zin) (Ein Z ni 


If the first-phase sample s, is selected according to 
SRSWOR such that the first-phase design weights are given 
by d,, = N/m, and the second-phase sample s, is selected 
from s, by SRSWOR such that d,, = m/n, then the 
calibrated estimator of the population mean becomes 

Ase meaieg) (3.1) 
where Y =Dies,Vj/M. X¥ = Dies, x;/n, and X" = Vics X,/m. 
The jackknife mechanism i in (2.1) becomes 


(ny — y,)(mx' — x,) 


if jes, 
m (nx — x,)(m— 1) . 
Yas (J) = (3.2) 
ae (mx' — x;) f je( 
(V eae if JE(s, — $2). 


Setting R = p/x, the difference between (3.2) and (3.1) 
can be written as 
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pt) Sa ee JES) 
(i=) rab) ; 
(3.3) 
; (x,- x ) . : 
a. if JE(S, —S). 


Expression (3.3) is exactly the same as reported by Rao 
and Sitter (1995). Assuming that ¥'(7)/xX(/) = ¥'/x, then 
the approximate jackknife estimator of variance is given by 


RK Ae ve : (y. — Rx Ny 
Por) el ee 
rack Ups ) [| 2 ahi) 


a x); 


n—-| 


- Rx, ) 


(x, -x'y 


+ R? Y—+—_—_.. 


eee) 


Thus, the Rao and Sitter (1995) estimator is a special case of 
the proposed jackknife technique. 


Case 3.2: Sitter (1997) 


In Case 3.1, if we consider q,, = 1, then the calibrated 
estimator under SRSWOR becomes 


nm 


Yo =y+b(x' -x), (3.4) 


where b” = ¥,., X,¥,/Zjcs,X) denotes an estimator of the 
regression coefficient that is slightly different from the 
one considered by Sitter (1997). The jackknife mechanism 
takes the form 


Vea fie 
ee oa peared 53), 


4 Se a 
n-\| Dame ee 


i€S5 


ee ee nx — a re (3.5) 
m—1 ae 


ik Fes; 55). 


If we set d, SW ie ie b'(x,- ~) a, = x,{x(j) - 
mG) LK. and ie =x; see ee K=("W= 1)s? + nx’, 
then the digeence Sateen (3.5) and (3.4) can be written as 
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A 


PL = 

a 
plaid ai scioass JALE be if jes, 
a wl 

1+ : 

(l=) 
gat de a) =e) on 
aie it je(S,—s) 


which is similar to the expression reported by Sitter (1997). 


Case 3.3: Raj (1965) 


In order to consider this case, we assume that the initial 
sample s, of size m is selected with replacement according 
to probabilities p, proportional to z,,i=1, 2, ..., N. 
Information on the auxiliary variable X is collected on this 
first-phase sample, s,. The second-phase sample, specified 
to be of size n, is a subsample of s, selected without 
replacement using equal probabilities. It is for s, that 
information on Y is collected. Under this sampling scheme, 
d,, =1\/m,, =1/(mp,) and. dj, =m/n. Thus, w, = 
(1/p;)/Xies, (1/p;) and ws, = (1/p;)/Lies, (1/p;). Note also 
that for this scheme, Xf = X/; thus no first-phase cali- 
bration is made. If q,, = 1/x,, then the calibrated estimator 
Y° becomes 


(3.6) 


where ve = ee (y,/D;)/ Dies, (1/p; oy ue = ee (x,/p;)/ 


ya (yp: and Xe = Dies, (; 1 D,)/ Died A! PD, ). Thus, 
Sarat a Raj = ies, (Vi/P; Lies, (%j/P) )}/{Zies, (%/Pi) 
Dae (dl / Pi )}. 


Under the sampling scheme described above, the 
jackknife estimator of population mean is 


ATG eet er 
ae oC) 
Yaa (J) 4 (3.7) 
ve x) if. je(s,—s,) 
xX; 
where 
mG) = 
2%! P,) ual 2a lia L/P) 
SOn 9. Celexa al 
ies, Ul’ p,) TES) 
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and x. 3(j) and ee j) are defined analogously. If R = 
Vo UxXeeand Wy, = (1/p,;)/Lies,(/p;), the difference 
between (3.7) and (3.6) can easily be written as 


| gh ne a 
X?G) : 
x 2 a (y; — Rx;) 
X3(/) 
Rt A eA elles 
R{X? (J) - XP} if J €(s,—5,). 


Thus, the jackknife estimator of variance of the estimator 
Yeai iS given by 


Vee) a 
DTS Gye = fp 
JES) i ') 
eS Bed wiles «es 
JES, 


es okra ne 
~28', we, LD (y, — Re, RC- FP}. 


A 


JES) Xsi(y) 


Following Rao and Sitter (1995), if we assume x (j)/ 
XS (/) We be 5, then the jackknife estimator of variance 
of a takes the form 


Viack e Raj ) = 


ae ey ve) Cpmeans 


JES3 


+ RY (XP) — XV 


JES, 


~ DRExe XS” ws (y;—Rx,;) OG) ay | 


JES2 


Case 3.4: Srivenkataramana and Tracy (1989) 


In order to consider this case, as in Raj (1965), we 
assume that the initial sample s, of size m is selected with 
replacement according to probabilities proportional to z,. 
However, the subsample, s,, of ” units is now selected 
with replacement using probabilities proportional to x, / z,. 
As a result, wi; = (1/z,)/Qies (i/z;) and wy, = (1/x,)/ 
Dies, (1/x;). Similar to Raj (1965), no _ first-phase 
calibration is made; thus Xf = X/. Hence, if q,, = 1/x,, 
then the calibrated estimator Y° is 


eae AEA BGR (3.8) 
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where ¥,? =Dics, (¥;/%;)/ Lies, W/x,), X2 = n/ Lies, 1/4), 
and’ X 7 = jee (cojlzetS Nes (1/z,). Thus, alternatively Y¥¢ = 
(Ely (plas) Diey Op/ a HHnDie,, 0/2) 

Under the sampling scheme described above, the 
jackknife estimator of population mean is 


aut. OCI Oh af Fes, 

Ce ee (3.9) 
V(X ()/ Xz} if fE(s, —5) 

where 

‘i Gres) STG) 

yo  _ ies, a eee eal Jeon s he ; : 

hg px) "diy -1) da)! 


The terms Y 3 (j) and x ; (j) are defined similarly; that is 


a ae ea) 


while ¥ ’ (/) can be written as 


Gz) ¥0y/x,) 


xe : ple se cae 
1) = px) "yds -1) S0m) 


Tee nr) eand wa (1) 0.)) De, (ii.): the 
difference between (3.9) and (3.8) is given by 


VO) iat 
ee 
EG 


Rie Gia 


(y,-Rx,) + REX?) -X?} if jes, 


if 7 S(S=5>). 


Following Rao and Sitter (1995), if we assume 
ae GDS (j) * ve Dea then the jackknife estimator of 
variance of Pe takes the form 


Ve Ce, )* 


| OM EONe i OmND PD 2 
a /X5} » (,) (y; —Rx;) 


Jes, 


+ RY EX? (f) — XPV 


JES 


— 2REX? 1X2} Y- we (vy — Rx, )(X2()-X} | 


JESy 


Case 3.5: Chand (1975) 


In order to consider this case, the first-phase sample s, of 
size m is selected using SRSWOR, and both auxiliary 


Statistics Canada, Catalogue No. 12-001-X 


62 Farrell and Singh: Some contributions to jackknifing two-phase sampling estimators 


variables Z and X are observed on the chosen units. The 
subsample, s,, of units is also selected using SRSWOR. 
Obviously, d,, = N/m and d,, = m/n, so that w; = 
1/m and w3, =1/n. If q, =1/z, and q,, =1/x,, then 
the calibrated estimator Y° becomes 


Ve = VQ x Qi (3.10) 
where 
— vy Vf OX ae or ieee 
and Z’ = >), z,/m. The jackknife estimator of Y is 
sD. it jes, 
“ ) ZU) 
MeRCe = (3.11) 
VJ) Y) 3 Lae (SoS) 
x Z{(Q) 


where Y(j) = (ny —y; )Mn—A), XG) Gx —x,)/(n 9); 
6G) = @a = x, )Mm =1), and finally Z'(j)=(nz' = 
z,)/(m —1). If we let R, = X'/Z' (an estimator of R, = 
X/Z) and idee y/x (anestimator of ke = YX), 
and similarly, let Rj) = x'(y)/Z'(J) and R,(/) = 
y(j)/x(J), the difference between (3.11) and (3.10) can 
be written as 


ede Yon 7 


(+ Re(DtR()d(/)+R5(y) if jes, 

(3.12) 
eG) if jE(s,—s,) 
where we can ee e G3. 12) that “e5(7) = 
RG) {%Y)-X}-R DR, Z 
xX}, 5,(/) = eee —¥'(/)} — RG){Z-7Z'()} -RZ - 
Z'}, = ae that the fem ¢/C/)= ae x'}— 
R (J) —Z}. _Thus the jackknife estimator of variance 
of the heat Ye is given by 


>( 10 se 
(D-2}, dD= 8G) 


Vinca | ey 7 


(om Din) Sea) + > R3(j)d5 (Jj) 


JES JES) 


+R> > 8,(/){55(/) + 2e,(/)} 


JES, 


+2R, >) 8,(7)e,() 


JES> 


+2R, > Rds (AE, (+8, ()} 


JES 


+P 


JES, 
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Case 3.6: Ahmed (1997) 


Consider the same sample design as in Case 3.5. Rather 
than q,,=1/z, and qg,,=1/x, as in Chand (1975), we set q,,= 
q>,= 1, and q,,=1/x,, then the calibrated estimator reduces 
to 

Yo. Sy 2b, (eo = x) Hb zs 


where b; = Dies, X))ij/ Lies, x? and b =D ies ei! tes & ze 
Note that (3.13) is a chain regression type estimator similar 


to Ahmed (1997). Letting b;(j) = b, + {x,(y, — 5, x,)/ 
(<;—Lies, xD} and by (A= +{z,(@;-b2,) @j-2 TES, 2), 
after jackknifing the estimator Y;,, becomes 
Vo = 

PA) + bs AE'YV)-X} 

+b (j)b,){Z-Z'()} if jes, 

; (3.14) 
Py + by {X'(j)-*} 
+ b (/)b,{Z-z'()} if J <(S, ao) 


The difference between (3.14) and (3.13) can be written as 


Yoh AO k= Your = 

(9) 5,8, 7) +05) a,(/) +b, oC et fess 
sis 36, (/)+5,(/,(/)+5;8,() if jes, ae 
b,€,(J) if jes; = 55) 
where we can write in (3.15) that ¢«, PN vy) - 
¥}—b, (JR) — ¥} -B (A) GEG) - 23, a) = 


{X'G)-¥}, 8,7) = &()-¥'()} — & DZ -7'D} - 

b, {Z —Z’}, and finally that the term ¢,(/) = {x’'() - 
¥'} —b (/){Z'(/) - Z}. Thus the jackknife estimator of 
variance of the estimator Vite is given by 


Vey es) a 
{(m — 1)/m* P e3(j) + be {b,(j)}"d; (/) 


JES> JES> 


+ {b,(J)}" & 8 (DS, (/)+2¢,()} 


JES 


+ 2b, > &,(/)e(/) 


+ 2b; D1 (Ade) +82()} 
+ {b,}° > i) 


JES, 


4. Simulation study 


In this section, we present the results of simulation 
studies designed to investigate the performance of the 
proposed jackknife procedure for estimating the variance of 
four of the two-phase estimators of population mean 
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presented in Section 3. Specifically, we consider the Rao 
and Sitter (1995) ratio-type estimator, the Sitter (1997) 
regression-type estimator, the Chand (1975) chain ratio-type 
estimator, and the Ahmed (1997) chain regression-type 
estimator. Initially, we describe and report the results of 
simulations that were conducted for the Sitter and Rao 
(1995) and Sitter (1997) estimators. This is followed by a 
discussion and summary of similar simulations on the 
Chand (1975) and Ahmed (1997) estimators. Unlike the 
case for the ratio and regression estimators, since complete 
information on a second auxiliary variable Z is required for 
the entire population in order to apply the two chain 
estimators, the simulations that were conducted for these 
two estimators are somewhat more complicated than those 
performed for the ratio and regression estimators. 


4.1 Simulation study: Rao and Sitter (1995) and 
Sitter (1997) 

For purposes of the first set of simulations, we assume 
that a first-phase sample of m units is selected from a 
population of N units, and only the auxiliary variable XY is 
measured. From the first-phase sample of m units, we then 
select a second-phase sample of ” units by SRSWOR in 
which both the study variable, Y, and the auxiliary variable, 
X, are measured. 

We began by creating a population of N units consisting 
of (X,, ¥,) pairs using the model 


¥; = BX, F Xen, 


with B = 10. Initially, we set g =0 and N =S00. For each 
i,i=l,.., N, we generated X, from a gamma 
distribution with a shape parameter of 3.1 and a scale 
parameter of one, and ¢, from a standard normal. From the 
resulting population of (X,, Y,) pairs, we selected 1,000 
first-phase sample of m = 100 units, and from each of these 
samples, we selected 10,000 second-phase samples of n = 
20 units. 

Under the sampling scheme used here, Rao and Sitter 
(1995) proposed the ratio estimator 


eyes es) (4.1) 


which has approximate variance 
Lee SE NT) S?, 
where | 
raqie nay [sacl s) Basar ia oa.) 
and ? 
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with Wty Ya Nae ee nihNandeR StY aXe For 
the r"" second phase sample (f =1, ..., 10,000) drawn from 
the k'" first phase sample (k = 1, ..., 1,000), we computed 
the usual estimator of variance 


A a 1 | 9 ] 1 ps 
Vs (14) — (+ Rin, Sagem te ee Satay? (4.2) 
nem tie IN, 
where the sample variances are 
n 
2 # =i = ; Ls 2 
Sa(t\k) =(n — I) » [Wien Very ey) — Ney ky Sep Xe )] 
i=l] 
and 
n 
si a) 85 
va (n= 1) Vice) 7 Vein) 
el 
with Voy = Lei /@ and Xypy=LerXyy/n. In 


addition, Fj.) = Vag) / Xx). We also computed the 


jackknife estimator of variance 


Princ [Yee (t1*)] = 


m-1< SMW Nd aie or 
~ Voy Xu) _ UA ses 5 (4.3) 
Tem =i Xeny) Xt\k) 


and the ratio of estimated variances 
RV(t1k) = PUCK C1 / Prac [Yas C1]. 


We then computed the average of the RV(t|) over all k 
and ¢, which is given by 


1 1,000 10,000 


= —_——_ RV(t\k). 
10,000 000 2 ps le) 


k=) f= 


We also determined empirical estimates of the biases in 
(4.2) and (4.3) by computing 


1 1,000 10,000 


Bay An. VY (t|k ay (Okay 
~ tence oe | [Yes (¢14)] - Ve 


and 


1 1,000 10,000 
EBJ =—————_ iV (04 (t|k)|-V (Yqs )}- 
10,000,000 2 2 ion. 
Note that the estimator given in (4.2) is unbiased. Finally, 
we calculated the relative efficiency of the usual estimator 
of variance to the jackknife estimator according to 


000 10,000 

> Bye Vi Cie Vey 
k=1 =| 
1,000 TT F 


NV enlias 1b) = Vs yer 


RE = 
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Using the same generated population of N =500, we 
repeated the simulation; however we used m =400 and 
n = 80 instead. We then created four additional populations 
of size N =500 using g =0.5, 1.0, 1.5, and 2.0. For each 
of these four populations, we repeated the two simulations 
described above where in the first simulation, m = 100 with 
n = 20, and in the second simulation, m =400 and n = 80. 
Finally, to study the effect of population size, we then 
repeated all the simulations based on the different values of 
g, m, and n when N =500 for three additional values of 
N, namely 5,000, 50,000, and 500,000. The results 
obtained for RV, EBU, EBJ, and RE for each of these 
simulations are presented in Table 1. 

The results for RE in Table 1 suggest that as the 
population size N tends to infinity (as considered by Rao 
and Sitter 1995), the jackknife estimator of variance remains 
more efficient than the usual unbiased estimator of variance. 
It is also the case for very large N that the values for RV 


Table 1 


tend to one. However, considering the cases where N = 
500, if the population size is relatively small, not only are 
the values for RV noticeably smaller than one, but the 
jackknife estimator of variance seems to be significantly 
biased. In addition, the jackknife estimator appears to be 
much less efficient than the usual unbiased estimator of 
variance, especially when m and vn are large. Of note here 
is the fact that Rao and Sitter (1995) and Sitter (1997) state 
that it is not clear how to fix the finite population correction 
factors in the jackknife estimator of variance in two-phase 
sampling. This would seem to be an area where further 
research could be fruitful, since it would appear that when 
the population size is small, it might be worthwhile to adjust 
the finite population correction factors instead of directly 
applying the jackknife technique according to the approach 
proposed here. Note that Kim ef al. (2006) have incorpo- 
rated a finite population correction factor in a special case. 


Comparison of the jackknife and usual estimators of variance of the ratio estimator of the population mean when B = 10 and the 
auxiliary variable, X, follows a gamma distribution with a shape parameter of 3.1 and a scale parameter of one 


N m n g 


500 100 20 0.0 
0.5 

1.0 

ites 

2.0 

5,000 100 20 0.0 
0.5 

1.0 

is 

2.0 

50,000 100 20 0.0 
0.5 

1.0 

1.5 

2.0 

500,000 100 20 0.0 
0.5 

1.0 

Ihe) 

2.0 

500 400 80 0.0 
0.5 

1.0 

Ih) 

2.0 

5,000 400 80 0.0 
0.5 

1.0 

1.5 

2.0 

50,000 400 80 0.0 
0.5 

1.0 

IES) 

2.0 

500,000 400 80 0.0 
0.5 

1.0 

les 

2.0 
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RV EBU EBJ RE 
0.801 0.006 0.542 eS 2A 
0.800 0.010 0.579 1.310 
0.805 -0.071 0.561 1.267 
0.816 -0.358 0.575 1.149 
0.840 -0.720 SED 0.935 
0.979 -0.028 0.042 4.015 
0.976 0.007 0.096 3.709 
0.965 0.023 0.172 3.210 
0.936 -0.073 0.337 1.308 
0.916 -1.103 0.493 0.967 
1.001 -0.002 0.003 6.241 
0.998 0.107 0.126 4.936 
0.981 0.101 0.196 2.965 
0.937 -0.211 0.167 1.558 
0.924 -0.355 0.940 1.005 
1.001 -0.057 -0.054 4.730 
0.999 0.014 0.024 4.669 
0.993 0.185 0.229 3.223 
0.940 -0.235 0.122 1.420 
0.907 -1.054 0.530 1.009 
0.214 0.000 0.520 0.002 
0.237 -0.001 0.523 0.002 
0.320 0.000 0.544 0.006 
0.530 -0.001 0.616 0.066 
0.733 -0.012 1.091 0.452 
0.919 -0.003 0.061 2.687 
0.920 -0.001 0.064 2.505 
0.922 0.003 0.077 2.058 
0.930 -0.028 0.077 bes 72 
0.940 -0.089 0.184 1.088 
0.991 -0.008 -0.001 4.550 
0.991 0.004 0.012 5.276 
0.991 0.000 0.009 4.163 
0.980 -0.024 -0.001 eat 
0.967 -0.171 -0.040 1.099 
1.000 0.009 0.009 5.501 
0.999 0.001 0.001 5.180 
0.993 -0.001 0.006 3.852 
0.992 -0.022 -0.018 1.809 
0.971 -0.179 -0.079 1.136 
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We also considered the Sitter (1997) regression esti- 
mator, and repeated the entire simulation study that was 
performed using the ratio estimator in (4.1). Specifically, 
rather than (4.1), we made use of the estimator 


Yo = +b (x' —Z), (4.4) 
which has approximate variance 
Ley ME iS? Gree NOS? 945) 


where 


$2 tN sei FS Bags 4 SOT 


i=] 


with 


N Ney 
Bop = NOG DIO. Ge 
1 i=l 


ie 


For each different combination of V, g, m, and n used in 
the simulation study, we computed 


V (Pe (t| = (=m) 82,4 (MN) s2 15, (4-6) 


(t| k) 


for the t” second phase sample drawn from the k" first 
phase sample, where the sample variance 


n 
2 5 = * me 2 
Sau\ey= (n — 1) lan Vony) ~ Sey Kian Kay T- 
i=l 


We also computed the jackknife estimator of variance 


A 


Viack [re (t\k)] a 
—Y» or ela 1) a) ey ys 


—{y +b (x' —x) Ff. 


For each different combination of N, g,m, and n, 
equations (4.5) through (4.7) were used to compute values 
for RV, EBU, EBJ, and RE analogous to those given in 
Table 1 for the estimator in (4.1). The results obtained were 
extremely similar to those for the ratio estimator. 


(4.7) 


4.2 Simulation study: Chand (1975) and 
Ahmed (1997) 


For purposes of the second set of simulations, we now 
assume that when the first-phase sample of m units is 
selected from the population of size N, information on two 
auxiliary variables X and Z is collected. When the 
second-phase sample of size n is selected from the first- 
phase sample, the study variable Y is measured, along with 
the two auxiliary variables X and Z. Note also that the 
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auxiliary variable Z is assumed to be known for the entire 
population. 

We began by creating a population of N =500 units of 
(X,, Z,, Y,) observations using 


y, ma BX; + BZ; + €;, 


with B, =3.5 and B, =2.5. For each i, i =1, ..., NV, we 
generated X, from a gamma distribution with a shape 
parameter of 2.2 and a scale parameter of one, Z, from a 
gamma distribution with a shape parameter of 0.1 and a 
scale parameter of one, and ¢, from a standard normal. 
From the resulting population of (X,, Z,, Y,) observations, 
we selected 1,000 first-phase sample of m =100 units, and 
from each of these samples, we selected 10,000 second- 
phase samples of » = 20 units. 

Following Chand (1975), a chain ratio estimator under 
two-phase sampling is given by 


Vp Ge /z)(ZIZ, 
which has approximate variance 
V(¥G,) = (0 '— m*) S3+(m-N) S35, (48) 


where 


N 
Sie Nis Dine iY) I (X= XI 
ji 
and 
2 N = 24 4 
Fons SAUDE Oat AeA) 
i=] 


with 


>I 


Realy Zand pro = Yo] 
computed 


In the simulation study, we 


VU(%S, C1) = =m") 53 gy FOR -N) 53 cys (4.9) 


for the r'" second phase sample drawn from the k" first 
phase sample, where the sample variances 


n 
2 =il =, y = 2 
Sa, (tk) =n 1) > [ie Yee) — evn Cay Xe) 
i=l 


with 
Lae 


hole — Yale! *ule 


and 
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n 
ee x -l . _ + = 5 = 2 
Sa, (t\k) = (2 I) Dre Very) — eae Zico Zn) 
I 
with Foie) = Vine) / Ze We also computed the jackknife 
estimator of variance 


Viack [( ea (t\k)] = 


m =100 with n=20, and in the second simulation, 
m =400 and n=80. Finally, to study the effect of 
population size, we then repeated all the simulations based 
on the different values of B,, B,, m, and n when N =500 
for three additional values of N, namely 5,000, 50,000, and 
500,000. For each different combination of N, B,, B,, m, 
and n, equations (4.8) through (4.10) were used to compute 
values for RV, EBU, EBJ, and RE analogous to those given 


case Voy VU ) Xun) aA in Table | for the estimator in (4.1). The results are provided 
= XejyV) Zaye) VY) in Table 2. 

4 wal 2 Generally speaking, the findings based on the results in 

AT is aU Z _ (4.10) Table 2 are similar to those arrived at for the estimators 

Xeetky Zeek) based on (4.1) and (4.4). In particular, the jackknife 


Using the same generated population of N =500, we 
repeated the simulation; however we used m =400 and 
n=80 instead. We then created three additional 
populations of size N =500 using B, =0.5 with B, =0.5, 
B, =3.5 with B, =05; and By =0.5 with B, =2.5. For 
each of these three populations, we repeated the two 
simulations described above where in the first simulation, 


estimator of variance is more efficient than the usual 
estimator when the population size is sufficiently large. 
However, also of note is the fact that this efficiency seems 
to be related to the magnitude of the regression coefficients 
B, and £,; that is, the jackknife estimator appears to 
achieve relatively greater efficiency for cases where the 
coefficient associated with the auxiliary variable X, is large 
relative to the analogous coefficient linked to Z. 


Table 2 

Comparison of the jackknife and usual estimators of variance of the chain ratio estimator of the population mean where the 
auxiliary variable, X, follows a gamma distribution with a shape parameter of 2.2 and a scale parameter of one, and the auxiliary 
variable, Z, follows a gamma distribution with a shape parameter of 0.1 and a scale parameter of one 


m n Bi Bz N RV EBU EBJ RE 
100 20 BS 2S 500 0.769 0.000 0.027 1.063 
5,000 0.831 -0.012 0.020 2.282 

50,000 0.818 -0.006 0.028 1.785 

500,000 0.852 0.001 0.036 12993 

100 20 0.5 0.5 500 0.911 -0.001 0.004 0.791 
5,000 0.943 -0.001 0.002 0.888 

50,000 0.948 0.000 0.003 0.896 

500,000 0.946 0.000 0.003 0.899 

100 20 3:5 0.5 500 0.845 -0.001 0.015 1.674 
5,000 0.932 -0.011 0.000 3.632 

50,000 0.947 -0.005 0.004 S221 

500,000 0.947 0.000 0.010 3.637 

100 20 0.5 ES) 500 0.866 -0.001 0.009 0.668 
5,000 0.858 -0.003 0.008 0.775 

50,000 0.855 -0.001 0.010 0.670 

500,000 0.855 0.000 0.012 0.697 

400 80 35 DES 500 0.540 0.000 0.013 0.044 
5,000 0.780 -0.001 0.009 1.346 

50,000 0.819 0.000 0.008 1.878 

500,000 0.810 -0.001 0.006 1.953 

400 80 0.5 0.5 500 0.817 0.000 0.003 0.254 
5,000 0.956 0.000 0.000 0.885 

50,000 0.973 0.000 0.001 0.946 

500,000 0.973 0.000 0.000 0.963 

400 80 a5 0.5 500 0.579 0.000 0.010 0.041 
5,000 0.907 -0.001 0.003 3.158 

50,000 0.954 0.000 0.002 3.845 

500,000 0.950 -0.001 0.001 4.853 

400 80 0.5 PN) 500 0.787 0.000 0.004 0.222 
5,000 0.862 0.000 0.002 0.570 

50,000 0.873 0.000 0.003 0.698 

500,000 0.875 0.000 0.002 0.595 
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Finally, an analogous simulation study was performed 
using the regression estimator of Ahmed (1997). However, 
the populations were created using B, =10 with B, = 0.5, 
B, =100 -with BP; =0.5, B, =0.5 with 6, =10, and 
B, =10 with B, =10. As before when the estimators of 
Rao and Sitter (1995), Sitter (1997), and Chand (1975) were 
considered, provided that the population is sufficiently large, 
the jackknife estimator of variance seems to be more 
efficient than the usual estimator. 


5. Conclusion and discussion 


In this paper, the problem of estimating the variance of 
various estimators of the population mean in two-phase 
sampling has been considered by jackknifing the famous 
two-phase calibrated weights of Hidiroglou and Séarndal 
(1995, 1998). Simulation studies based on ratio, regression, 
and chain-type estimators suggest that provided that the 
population size is large enough and the first and second- 
phase samples are relatively small, the jackknife estimator 
of variance is more efficient than the usual estimator of 
variance, regardless of the estimator for the population mean 
that is considered. For small populations, it might be 
worthwhile to adjust the finite population correction factors 
instead of directly applying the jackknife technique. This is 
an area where further research could be conducted. 
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Appendix 
Derivation of the jackknife estimator in (2.1) 


In this part of the appendix, we prove (2.1) for the 
jackknifed estimator of the population mean in two phase- 
sampling. First, note that B, (Ga B, +t,,@, and B, Gi) = 
B, +4 e);, where ¢, , = prea Go ee ai 
Q,=x;- Bz, t= a Wo Xj! (Qa jW3 jXj — Lies, Voi 2} we 


and e,, = y =, x; We also fee ey = Zz i 

h, (Z; ap! Ru X? +h, (XP xj), On es 
ek Xe a» and Y,’(j) = ye + Oe - Vi); where 
hj = WwW 1d —w,) and h,, = w;,/( — w;,). 
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Using these results, for 7 € s,, we have 
¥°(j) = ¥y + B,(Xy - ¥7) + BB,(Z- 2?) 
a hy (Y - Y)t+h, Og - ne 
Ce a (xs ny} 

+ t,,¢,B,(Z- Uh Peers Ca z.) 


4: Date On( Ze Zs 


27 22; wee; h,(Z? - a) 
=3 B, B, Aa Ze)! 

Similarly, for 7 € (s, — s,), we have 

¥°(j) = + B(x? - X3) + B,B,(Z- 27) 
“yb h,, (X —x,)+4,e,B)(Z- Zey 

—1,,6,Byh, AE 25) 

+B: 6,(Z-Z?) -h,(Z? — z,)}. 

Thus for j € s,, 

Fea yeeuy= x) 

— 6,4) BIZ?) - Z} 

+ B,LEX?(/) — X7} - BZ?) - Z3) 


¥°() = ¥° = iG) Zs ‘a \ 


A — aa 


~ B,()&Z = Z?()} - BAZ - 2731, 
and for j.e,(s) +55), 
Yj) -¥° = BX) — X73 - BZ?) - ZH, 
which proves (2.1). 
Consistency of the estimator of variance in (2.2) 


In this part of the appendix, we prove that the estimator 
Vee) in (2.2) is consistent. First, note that the 
variance of the estimator Y° defined in (1.6) can be 
approximated as: 


V(¥°) = V(¥’) + B[V X?) + VX2) — 2Cov(X?, X3)] 
+ Bi BV (Z;) 
MoO [Con oN Covll x) 
2988, Cov(rnZ} 


— 28, BSI [Cov(X?, Zy = Cov x Ze )). 
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If it is assumed that B,(7) ~ B,, 6,(7) = B,, and similar to 
Rao and Sitter (1995), that X¥,(7)/X,(J) = X,,/%,, it is quite 
straightforward to show that 


TIN O-F Pe VIBW-HP +B DUIW)-AF 


Jes JES, JES, 


OVI es Cee 


JES 


2) Gell | 


JES2 
— 263 > [X°G)=XP I Q=*s1 
VES 


=26,8; VE G)—Y ZG) 
— 26,83 SLXVG)- XZ?) -Z7 


JES, 


Since the ten terms on the right hand side of this equation 
for ¥j-,[Y°(j)-Y°} are the consistent estimators of the 
analogous ten terms in the equation above for V(Y°), it 
may be concluded that the jackknife estimator of variance in 
(2.2) is consistent. 
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A comparison of sample set restriction procedures 


Jason C. Legg and Cindy L. Yu ' 


Abstract 


For many designs, there is a nonzero probability of selecting a sample that provides poor estimates for known quantities. 
Stratified random sampling reduces the set of such possible samples by fixing the sample size within each stratum. 
However, undesirable samples are still possible with stratification. Rejective sampling removes poor performing samples by 
only retaining a sample if specified functions of sample estimates are within a tolerance of known values. The resulting 
samples are often said to be balanced on the function of the variables used in the rejection procedure. We provide 
modifications to the rejection procedure of Fuller (2009a) that allow more flexibility on the rejection rules. Through 
simulation, we compare estimation properties of a rejective sampling procedure to those of cube sampling. 


Key Words: Rejection sampling; Cube sampling; Stratification; Balanced sampling. 


1. Introduction 


A common practice in survey sampling is to utilize 
known population information about auxiliary variables to 
improve estimators of means and totals of characteristics of 
interest. When population control means or totals for an 
auxiliary variable are known, regression and other cali- 
bration estimators are often utilized. Let (x,, y,, p,), i = 1, 
2, .... N, be a sequence of real vectors, where each x, is a 
A dimensional vector, and a sample 4 be selected from 
Fy =[(%; Yj Di)> ++» (Xy» Vy» Py)] using a sample design 
with inclusion probabilities p, and joint inclusion proba- 
bilities p,. Suppose the population mean of x;, Xy, is 
known. Consider the regression estimator of the population 
mean of the form 


A 


Ve, a Zy B, (1) 
where z, contains design variables and x,, Z, is the popu- 
lation mean of z,, and §B is a regression coefficient 
estimator. For many designs, B of the form 


=I 
Bi 26,073) Dab Pi Yp (2) 


icA icA 


where @, are constants determined by the design, will be 
asymptotically efficient. Some examples of , choices are 
o, =(1— p,) for Poisson sampling and for stratified random 
sampling, $, =(N,-1)'(N,-—n,) for element i in 
stratum h. If we assume there is a vector d such that 


; Pi 33d =p; (3) 


for all i, then estimator (1) is design consistent (Fuller 
2002). The regression coefficient estimator (2) converges 
together with 


= N 
By i Da, o; (ee 
i=l 


N 

= ' 
Dileep? z 
st! 


As an example of applying equation (3), suppose we plan to 
select a Poisson sample and want to regress on a single 
covariate x,, through the origin. If we add (1— p,)' p, into 
z, to make z! =(x,, [l—p,]'p,), then (1) will be design 
consistent for ¥, since (3) is satisfied by setting d'= 
(0, 1). If we further assume that a column of ones is in the 
column space of the regression variables z,, then for these 
o, values, estimator (1) nearly attains the minimum 
asymptotic variance for design consistent regression esti- 
mators under certain regularity conditions (Rao 1994). An 
alternative approach to constructing a regression estimator is 
to start with a design consistent estimator, such as the 
generalized regression estimator of Sarndal (1980), and 
determine the best coefficient given that form of the esti- 
mator. Starting with a design consistent form removes the 
need to satisfy (3). Condition (3) allows estimator (1) to be 
expressed in the form of a generalized regression estimator 
(Fuller 2009b, pages 116-117). 

When auxiliary information is known at the unit level, 
the auxiliary information can also be incorporated into the 
sample design. For example in one classic case, the model 
with 


¥, = Bo t+ Bix, +%€; (4) 


¢, ~ ind(0, 0) and cov(e,, x,)=0 is assumed for the 
population Fy. From Isaki and Fuller (1982), the optimal 
inclusion probabilities for the regression estimator are those 
that are proportional to the square root of the design 
variances, i.€., p, « x, in this case. A possible sampling 


procedure is Poisson sampling with inclusion probabilities 
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N 
er | tie (5) 


ie g) 


where ny =>%,p, is a specified target sample size. A 
second common design when model (4) is assumed is to 
stratify the population based on x. Strata are determined by 
setting the boundaries such that the sum of the sorted x, 
values in each stratum are approximately equal. An equal 
number of units in each stratum are selected. This stratifica- 
tion design has the inclusion probabilities close to (5), and 
was shown to have an anticipated variance close to the best 
purposive sample model variance in the two-per-stratum 
case (Fuller 1981). 

Another way to incorporate information from an auxil- 
iary variable into the design is balancing. A sample 4 is 
balanced for variable z if 


N 
Zur = aoe ZN Ds, = Zn. (6) 

icA i=l 
A design is balanced for z if every sample with positive 
probability is balanced for z. Balancing can be thought of 
as calibration by design. To illustrate the effect of balancing, 
consider an equal inclusion probability design and z, = 


(1, x,). The conditional prediction variance of y,., under 
model (4) is 
V(Veog— Yn |%> Xap) = EV Gyr Fy) | 5 ar 

+ (Xy —Xur yp VB, x, Xr Mh (7) 


where uw, =x, €,. For a balanced design, the second term in 
(7) is 0, which suggests we might improve the estimator by 
balancing on x. In practice, a combination of balancing and 
calibration will often outperform either technique used 
alone. 

Balanced sample designs have some additional practical 
value. For many designs, there is a nonzero probability of 
selecting a sample that contains undesirable auxiliary 
variable values. For example, an undesirable sample could 
be a sample with insufficient sample allocation for domains 
or a sample with a large number of extreme values of 
auxiliary variables. Although stratified designs reduce the 
set of such possible samples by fixing the sample size 
within each stratum, undesirable samples could still be 
possible. For example, some stratified samples might have 
some negative weights from using regression estimators. 
Balancing can remove poor performing samples by only 
retaining samples with estimates close to known quantities 
and with only positive weights for regression estimators. 

Balanced sampling was proposed by Royall and 
Cumberland (1981) as a way to reduce model bias from 
incorrectly specified polynomial superpopulation models. 
Valliant, Dorfman and Royall (2000) discuss the implica- 
tions of balancing from a prediction approach to sampling. 
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Deville and Tillé (2004) investigated methods of selecting 
balanced samples within the design-based framework 
described above. See also Tillé (2006 Chapter 8) for a 
detailed treatment of balancing. In practice, finding a 
perfectly balanced design may not be possible. Very tight 
balancing can lead to a design with some extreme joint 
inclusion probabilities, including zero inclusion proba- 
bilities. Therefore, partial balancing is done in practice. 

In this paper, we compare design properties through 
simulation studies of two balancing procedures, the rejective 
sampling of Fuller (2009a) and the cube sampling of Tillé 
(2006). We also provide modifications to Fuller’s rejective 
sampling procedure that allow for more flexibility in 
balancing. In Section 2, the rejective sampling and the cube 
sampling are described. Properties of the inclusion proba- 
bilities of the two balancing procedures are compared in 
Section 3. In Section 4, some simulation results using 
balanced samples are presented. In Section 5, we provide 
adjustments to the rejective procedure. Concluding remarks 
are made in Section 6. 


2. Balanced sampling procedures 


Rejection sampling involves discarding any sample that 
does not meet a specified balancing tolerance. Fuller 
(2009a) presents one condition for rejecting a sample and 
Royall and Herson (1973) give another. In Fuller’s 
procedure with the balancing variable vector z, a sample is 
selected under a specified initial design and retained if 


Ger 20) (VY Gar |FaOl Gar 2) = Y (8) 


for some constant y > 0, where Z,, is the Horvitz- 
Thompson mean estimator for variable z, Fy, 1s the given 
finite population, 

} N 
V Zur | Py) = Ney i (Pp; — P; Pj )%; 2; ae ais 


N N 
i=l! j=] 


re, 


p, is the inclusion probability for unit 7 and p, is the joint 
inclusion probability of unit 7 and unit 7 under the initial 
design. Otherwise, the sample is rejected, a new sample is 
selected under the initial design, and condition (8) is 
checked for the new sample. If the original design has a 
central limit theorem, the left side of (8) is asymptotically a 
x° random variable with degrees of freedom equal to the 
number of auxiliary variables. An approximate rejection rate 
can be set using the quantiles of a y° distribution for y. 
Choice of a rejection rate will depend on objectives of each 
individual survey. Low rejection rates may not reduce the 
variance by a large amount, but provide sufficient comfort 
to a researcher that a very poor sample will not be selected. 
On the other hand, high rejection rates could provide large 
reductions in the variance, but the resulting samples could 
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have insufficient sample size to accommodate unplanned 
domain analysis. For example, if a researcher decides to 
conduct domain analysis on the tail of the distribution of a 
balancing variable, the joint inclusion probabilities could be 
small leading to few units in the domain for many samples. 

The cube method was developed by Tillé and Deville 
and is described in Tillé (2006). The cube method attempts 
to select a balanced sample with predetermined first-order 
inclusion probabilities. If the first-order inclusion vector 
does not lead to a balanced design, an additional step of 
minimizing a cost constraint is used. Unlike the rejection 
procedure, higher order initial inclusion probabilities are not 
prespecified. The cost minimization step maintains the 
specified initial first-order inclusion probabilities. 

As a way to understand the cube procedure, Tillé (2006) 
describes sampling geometrically. The set of all possible 
samples is defined to be the set of vectors for vertices of an 
N dimensional unit cube. For example, if N =3, the 
vertex (0,1,1) denotes a sample containing units two and 
three. Using the balancing equation (6) and desired p, for 
i =1,...,.N, a balancing plane is created. Any sample 
where the balancing plane intersects a vertex of the unit NV 
dimensional cube is a balanced sample. The design is 
balanced if every point of intersection between the 
balancing plane and the unit cube is a vertex of the unit 
cube. The cube sampling procedure begins by selecting a 
vector on the balancing plane, then a random walk from the 
initial point to an edge of the unit cube is done. Tillé refers 
to the random walk step as the flight phase. If the edge point 
at the end of the random walk is a vertex of the unit cube, 
the sample is selected. Otherwise, a cost minimization 
procedure is used to convert the fractional components of 
the edge vector to integers. The integer components of the 
edge vector are not changed in the cost minimization step. 
Tillé refers to the cost minimization step as the landing 
phase. Rejection sampling with high rejection rates 
produces results similar to cube sampling. 

Other procedures besides rejection and cube sampling 
can be used to obtain nearly balanced samples. For example, 
stratification with boundaries determined by the x variables 
can also introduce some balancing effects to samples (Fuller 
1981). Deciding the number of variables to use in the 
rejection and cube sampling procedures is essentially the 
same process as deciding how many variables to include in 
a regression estimator. 

Software has been developed for selecting cube samples. 
For rejection sampling, standard software packages can be 
used to select a sample and compute (8). A loop needs to be 
written to complete the procedure. Programs for selecting 
cube samples have been written for SAS and R. See 


(| 


Rousseau and Tardieu (2004) for SAS and Matei and Tillé 
(2005) for R, and details of the procedures implemented are 
addressed in Deville and Tillé (2004). The R program 
available in the sampling library was used in the simulations 
in this paper. Because the cost minimization step of cube 
sampling is computationally intensive for more than 20 
balancing variables, a variable suppression step is recom- 
mended for the landing phase in the programs. 


3. Inclusion probabilities 


Let 1, be the first-order inclusion probability for unit i 
and 1, be the joint inclusion probability for unit i and 
under a balanced design. Both rejective and cube sampling 
require initial first-order inclusion probabilities as inputs. 
The first-order inclusion probabilities are different than the 
initial values for rejection sampling. For rejection sampling, 
units closer to the population mean will have a slightly 
higher inclusion probability than units far from the mean. 
Cube sampling maintains the first-order inclusion proba- 
bilities from the initial specification. That is, for cube 
sampling m, = p,. Although for rejection sampling 1, ¥ p,, 
in general, the estimators considered will still use p, rather 
than 7. 

To illustrate differences between initial and final 
inclusion probabilities, samples of size 20 from a population 
of 100 units were simulated. The population of x -values 
was generated as random variables from a standard normal 
distribution. The rejection procedure used simple random 
sampling as the initial design and balanced on x. The cube 
sample procedure used a balancing vector of z, =(p,, x;)’, 
where p, = 20/100 for all i. The inclusion of p, in the 
balancing vector for cube sampling was to control the 
sample size so that the resulting design would be 
comparable to using an initial design of simple random 
sample design in the rejection sampling simulation. First- 
order inclusion probabilities were estimated using a Monte 
Carlo simulation of size 100,000 (Figure 1). The curve was 
obtained by nonparametric fitting. An approximate 90% 
rejection rate was used for the rejection sampling. From 
rejection sampling theory, first-order inclusion probabilities 
are approximately a quadratic function of the distance 
x, —X, for an equal probability initial sample design (Fuller 
2009a). The plot suggests that all first-order inclusion 
probabilities are 0.2 for the cube sample design. As 
expected, Figure | indicates the cube method maintains the 
specified first-order inclusion probabilities, but the rejective 
does not. As a result, the Horvitz-Thompson estimator using 
the initial inclusion probabilities (p,) and the rejective 
samples is biased. 
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Figure 1 Simulated first-order inclusion probabilities. The 
balancing variable for the rejective method is 
z;=x;,, and for the cube method is z; =(p;, x;)’ 
where p; = 20/100 


The joint inclusion probabilities for the rejection 
sampling procedure differ from those of the initial design. A 
pair of units i and j; are likely to have a high joint 
inclusion probability if x, +x, — 2X, is close to zero for an 
equal probability initial sample design. The joint inclusion 
probabilities were estimated from simulated samples of size 
20 from 100 (Figure 2). The joint inclusion probability for 
simple random sampling is 0.038. The rejection sampling 
joint inclusion probabilities are approximately a quadratic 
function of x,+x, The plot of cube sampling joint 
inclusion probabilities against x, +x, appears to have 
sharper angles than the rejection joint inclusion proba- 
bilities. High joint inclusion probabilities for the cube design 
are associated with pairs of units that are on the far opposite 
sides of x. That is, for the sample value of x; +x,, those 
pairs with a large value of |x,|+|x,| have a large 
probability of inclusion (Figure 3). 
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Figure 2 Simulated second-order inclusion probabilities. The 
balancing variable for the rejective method is 
z; =x;, and for the cube method is z; =(p;, x;)’, 
where p; = 20/100 


The Horvitz-Thompson estimator using the initial 
inclusion probabilities under rejection sampling has an 
0; (n_') bias while the Horvitz-Thompson estimator under 
cube sampling is unbiased. The standard Horvitz-Thompson 
variance estimator is biased for both procedures. Using 
Monte Carlo methods, the inclusion probabilities can be 
estimated so that nearly unbiased Horvitz-Thompson 
estimators can be used. However, for a large population, 
simulating enough samples to give a precise estimate of the 
joint inclusion probability for each pair of units is 
impractical. An alternative approach to variance estimation 
is to use a regression estimator and the variance estimator 
for the regression estimator. This is intuitively appealing 
because balancing is similar to regression through design. 
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Upon using the regression estimator, the bias of the 
regression estimator under both cube and rejective methods 
is of the same order. For rejective sampling, Fuller (2009a) 
gives conditions for the consistency of the variance 
estimator for the regression estimator. For cube sampling, 
Deville and Tillé (2005) and Tillé (2006) suggest using the 
variance estimator for a regression estimator furnishes a 
good approximation to the variance of the Horvitz- 
Thompson estimator. The variance estimators proposed by 
Deville and Tillé (2005) perform well when the joint 
inclusion probabilities of the resulting cube design are 
approximately equal to joint inclusion probabilities from a 
Poisson design. In the simulation studies of Section 4, the 
variance estimators proposed in Fuller (2009a) and Deville 
and Tillé (2005) are evaluated. 
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Figure 3 Simulated second-order inclusion probabilities with 
absolute sums of x. The balancing variable for the 
rejective method is z;=x;, and for the cube 
method is z; =(p;, x;)', where p; = 20/100 


13 


4. Simulation of the regression estimator 


A population of size 100 was generated from the model 
Vee Xa 0.55.02 teens, (9) 


é, ~ lid N(0, 0.4), where the x, are fixed values in the 
range of 0 to 4 (Figure 4). Seventy-two of the x values 
were randomly simulated values less than 1.15 from a 
standard exponential distribution. The remaining 28 values, 
ranging from 0.18 to 4.0, were deterministically added to 
form the data set of x. The fixed x values were selected to 
be fairly right skewed so that some large and small strata 
when stratifying the population on x with approximately 
equal within-stratum sum of sorted x, will be produced. 
The population was held fixed after initial selection. Model 
(9) contains a quadratic term, and was picked to simulate 
performance of the design and estimator strategy when 
model (4) was assumed in design and estimation. 
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Figure 4 Simulation population under model (9) 


We consider Poisson sampling and two-per-stratum 
stratified random sampling as initial designs. Strata were 
determined by setting the boundary so that the within 
stratum sum of sorted x, was roughly equal for all strata. 
The sample size was set to 20, and ten strata were formed. 
The stratum sizes were 35, 15, 11, 9, 8, 7, 5, 4, 3, and 3. The 
rejection procedure used a stratified two-per stratum sample 
selection with equal inclusion probabilities within a stratum. 
The stratum boundaries were chosen this way so that the 
inclusion probability of unit i is closely proportional to x,, 
which is the optimal inclusion probability under model (9) 
(Ikasi and Fuller 1982). Such a stratified design can also 
partially balanced on x through a standard design. Balance 
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in the stratified random sampling design is achieved using a 
step function to approximate a line. The stratified design 
will also be partially balanced on x. The stratified random 
sample design is intended to illustrate how much more one 
can benefit from additional balancing. Two units per stratum 
were drawn in order to obtain the maximum number of 
strata while still permitting unbiased variance estimation. 
Fuller (1981) showed that, in the two-per-stratum case, this 
stratified design has an anticipated variance close to the best 
purposive model variance under (4). Initial inclusion 
probabilities for the Poisson design with expected sample 
size 20 were set to the initial inclusion probabilities of the 
stratified design. 

The regression estimator considered in this paper is in the 
form of (1) with B defined in (2). The regression variable 
z is a vector of auxiliary variables that contains design 
variables and x. For the Poisson designs, we used 
z= ps x. (l—p,)'p,)' as the vector of balancing 
variables and as the regression variable vector. The first 
variable provides control for population size, the second 
variable is a control for sample size, the third variable 
provides balance on x, and the fourth variable guarantees 
that the regression estimator is design consistent. See 
condition (3) for the design consistency of y,,, and set 
d =(0, 0, 0,1)’. For two-per-stratum stratified samples, the 
vector of balancing variables is (x,, /,,, /5;,..-. /,9;) for cube 
sampling, where /,, are the stratum indicator variables 
defined as 

unit 7 in stratum h 


] 
i = : 
QO otherwise 


for h=1, 2, ..., 10. Only the x variable is included in the 
rejective balancing procedure since the sample from this 
initial design is automatically balanced on the stratum 
indicator variables. The regression variable vector for both 
balancing procedures is z, = (x;, [ij ---» L19;)’- 

For the initial designs, the variance estimators for ),.. 


are the variance estimators of the mean of e, = y, —z/ By 


calculated with é@, where @&=y,—z;B. For Poisson 
sampling, the variance estimator 1s 


V (Preg) =(n-s)! nZpy M:! i © pe 
icA 
(=p )ie? 2 Mazy, (10) 
where 
M,, =N*)) 2, 7? - 2) z, 


icA 
and s is the number of variables in z. Derivation of (10) is 
provided in the appendix. 


For stratified random sampling with two-per-stratum, the 


variance estimator for Y,.. 1s 
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A, is the sample set in stratum h, W, =n,/N,, 0, = 
(N,-1)'(N, —2) for units in stratum h, z,, is the 
auxiliary variable vector z, in stratum h, 


A — i ' a 
Cr Vian in ea ee) Ds 


y, and Z, are stratum means of y,, and z,,, respectively, 
and H =10 is the number of strata. The derivation of (11) 
follows the same approach to the one in appendix and has 
been omitted. 

For rejective sampling, the same variance estimators (10) 
and (11) using the initial design inclusion probabilities, were 
used to compute the variance estimator of y,., for rejective 
samples. Fuller (2009a) proved that the large sample prop- 
erties of the regression estimator for the rejective sample are 
the same as those of the regression estimator for the original 
inclusion procedure under some regularity conditions. For 
cube sampling, a variance estimator proposed by Deville 
and Tillé (2005) was evaluated for y,,, using cube samples. 

Let p(-) denote the initial design and z(-) be the 
resulting scheme after balancing. The number of samples 
selected was 30,000 for each Monte Carlo simulation under 
initial designs, cube sampling and rejective sampling with 
both 90% and 95% rejection rates. The Horvitz-Thompson 
estimator Y,,; and the regression estimator Y,.. were 
constructed using initial inclusion probabilities p,. Note 
that for rejection sampling, the Horvitz-Thompson estimator 
using the initial inclusion probabilities is not the Horvitz- 
Thompson estimator under the balanced designs. For each 
initial design, the following quantities were computed in the 
simulation studies. 


V,,(Vur) (or V,,(Vreg)): Monte Carlo variance of the 
Horvitz-Thompson estimator (or the regression esti- 
mator) using samples from initial designs. 

- V(Yur) (or V.(Vreg)): Monte Carlo variance of the 
Horvitz-Thompson estimator (or the regression esti- 
mator) for balanced samples. 

bias, (Vyr) (or bias, (Vy. )): Monte Carlo bias of the 

Horvitz-Thompson estimator (or the regression esti- 

mator) using balanced samples. 
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For cube samples, 


. Vive (Yah estimated variance of the regression 
estimator using the variance estimators in Deville and 
Tillé (2005) and each cube sample. 
ave (Vpr (Vreg)): Monte Carlo average of Vopr (He) 

using all cube samples. 


Deville and Tillé (2005) recommend several variance 
estimators based on a Poisson sampling approximation with 
corrections for known constraints in the design variance. 
The first three estimators in Deville and Tillé (2005) have 
minor differences, therefore only the second estimator was 
used in the simulation studies. Deville and Tillé (2005) also 
propose the fourth estimator, but that estimator requires 
solving a nonlinear equation system, which would have 
been computationally expensive to add to the simulation. 
However, the fourth estimator could perform better than the 
other cases for stratified designs, since their fourth estimator 
reproduces the variance of a stratified random sample when 
the balancing vector contains stratum indicators. 


For rejective samples, 


ay (Vreq ): estimated variance of the regression estimator 
using equation (10) (or (11)) for the Poisson (or two- 
per-stratum stratified) initial design and each balanced 
sample. 

: ave (V (Veeg 


all balanced samples. 


)): Monte Carlo average of V(y, 


/ reg 


) using 


In the simulations, V Vreg 
samples, for comparison. 

Table | reports the estimates for the Poisson design. The 
variance of the Horvitz-Thompson mean under initial 
Poisson sampling with expected sample size 20 and no 
balancing is V,,(¥y7) = 0.08. The variances in Table | are 
standardized by V,,(Vy_), and the biases are standardized 
by .V,(Vyr). The Horvitz-Thompson estimator is 
unbiased under the cube method designs, because cube 
sampling retains the first order inclusion probabilities. The 
Horvitz-Thompson estimator using initial design inclusion 
probabilities is biased under rejective sampling since the 
inclusion probabilities differ from the initial design inclu- 
sion probabilities, as indicated in Figure |. The bias of the 
regression estimator under rejective sampling is less than the 
bias of the Horvitz-Thompson estimator with initial design 
inclusion probabilities. The bias of y,,, under both cube 
and rejective procedures is of the same order. Increasing the 
rejection rate increases the bias of y,., for the rejection 
designs. However, the biases in y,,, under both balancing 
procedures and rejection rates are negligible relative to the 
Monte Carlo variances. For the Horvitz-Thompson esti- 
mator using initial design inclusion probabilities, the gain 
from using the balanced sample is substantial for both cube 


) was also computed for cube 


% 


and rejective methods. The mean squared errors are further 
reduced by using the regression estimator along with either 
balancing procedures. The gain from using the regression 
estimator is larger for rejective sampling than for cube 
sampling, likely due to the cube method achieving tighter 
balance than the rejective method. Both procedures lead to 
similar variances for the regression estimator. The variance 
of the regression estimator under the Poisson initial design 
is V,(Vreg) =9.249 (relative to V,(Vy_)). By comparing 
0.249 to the fourth row of Table 1, we can see that the gain 
from using the balanced samples on the regression estimator 
is moderate. The result is consistent with the finding in 
Fuller (2009a) that the variance reduction in y,,, by using 
rejective samples is due to a second order correction. The 
variance estimator of y,.. using (10) has small bias for both 
cube and rejective samples (ave(V(y,,, )) in Table 1). The 
variance estimator ard Oe proposed in Deville and Tillé 
(2005) performed similarly as V (Vee) In (10) since the 
second variance estimator in Deville and Tillé (2005) is very 
close to (10) for Poisson sampling. This result supports the 
claim that the Poisson approximation assumption in the 
variance estimators of Deville and Tillé (2005) is satisfied 
for the Poisson design case. 


Table 1 
Properties of samples based on Poisson sampling of expected 
size 20. V>(vur) = 0.08 and Vy (Vreg)/VpVur) = 0.249 


Cube Rej. Rej. 

90% 95% 

bias, (Vur)/4/V,(Wur) -0.002 -0.016 -0.007 

bias, (Vreg)/4| Vp (Fur) -0.002 0.002 0.005 

V.( Yur )/V pur) 0.142 0.270 0.220 

Vie Vreg)/V Yur) 0.131 0.136 0.129 

ave (V (Vreg ))/V (Fur) O22 0.123 0.121 
ave (Vir (Veep )) Vp Yur) 0.120 - - 


In Table 2, estimates under the initial two-per-stratum 
stratification design are reported. The variance of the 
Horvitz-Thompson mean under the initial stratification 
design is V,,(Yy7) = 0.011 and all estimates are standardized 
by this value. Since stratification in this initial design 
controls for most of the effect of x on y, the regression 
estimator is not a large improvement over the Horvitz- 
Thompson estimator using initial design inclusion proba- 
bilities. The bias and variance of },,; are close to those of 
Vieg under both cube and rejective methods. The larger 
estimated bias in Yur under cube sampling is due to Monte 
Carlo error. The gain from balancing on x is not large, 
compared to the gain in the Poisson example. However, 
with this highly controlled initial stratified design, in which 
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the initial samples are already partially balanced on x, there 
still can be a modest benefit from additional balancing and 
using y,., estimators. This result is seen for y,., by 
comparing the fourth row of Table 2 to the variance of Y,.. 
under the initial design V,,(V,..) = 0.987. Therefore, in this 
case a good strategy is to combine stratification, balancing, 
and regression, which is a similar conclusion drawn in 
Deville and Tillé (2004). The variance estimator V Vrex) 
using (11) gives estimates on average for the regression 
estimator variances under both cube and _ rejective 
procedures that are close to the true variances. However, the 
variance estimator V,,, (Vee) proposed by Deville and Tillé 
(2005) performed poorly for cube sampling. A possible 
reason is that the Poisson sampling approximation in the 
second variance estimator of Deville and Tillé (2005) 
assumes joint inclusion probabilities that are far from the 
actual joint inclusion probabilities in the small strata. The 
joint inclusion probabilities in the small strata are closer to 
those of stratified random sampling than Poisson sampling. 
This issue might explain why V Vreg) in (11) using the 
initial two-per-stratum inclusion probabilities is less biased 
than Vor) in this case. 


Table 2 
Properties of samples based on stratified sampling of size 20. 
V,(Yur) = 9.011 and V,(Freg)/Vp (Fur) = 9-987 


Cube Rej. Rej. 

90% 95% 

bias, (Yur)/ «| Vp(Fur) -0.028 0.014 0,010 

bias, (Vreg)/ «Vp Sur) -0.013 0.014 0.010 

V. Dur )/V Sur) 0.910 0.866 0.813 

V,.(Vreg) Vy Fur) 0.929 0.865 0.813 

ave (V (Freg))/Vp (Fur) 0.907 0.881 0.775 
ave (Vr (Yreg))/Vp (Yur) 0.792 5 : 


To assess large sample properties of the balancing 
procedures, the size of the Poisson simulation was 
quadrupled. The population was replicated four times and a 
sample of expected size 80 was selected. The Horvitz- 
Thompson variance of a mean under the Poisson design is 
V,,(Vur) = 9.020 and the regression estimator variance is 
V, Veg) = 9.132. The resulting relative variances and biases 
are close to the results for samples of size 20 (Table 3). The 
simulation results agree with the theoretical result of Fuller 
(2009a) that the regression estimator is an O, (nisi) 
estimator after rejection of the type used in this paper. 
Although it has not been proven here, regression estimator 
after cube sampling appears to possesses similar properties 
to the regression estimator using rejection sampling. 
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Table 3 
Properties of samples based on Poisson sampling of expected 
size 80. V>(Vur) = 0.02 and V p(Vreg)/Vp ur) = 0.132 


Cube Rej. Rej. 
90% 95% 
bias, (Vur)//%,(ur) 0.002 -0.006 -0.007 
bias, (Vreg)/ 4 Vp Fur) 0.002 0.000 -0.001 
V. Yur )/V, (Yur) 0.127 0.267 0.224 
ave (V (Vreg )) Vp Yur) 0.121 0.121 0.121 
ave (Vir (Vreg))/Vp Yur) 0.121 - - 


5. Adjustments to the rejection procedure 


Fuller’s rejection sampling procedure treats all balancing 
variables with the same importance. For a large number of 
balancing variables, exact balance on all variables cannot be 
expected and the approximation could be poor for some 
important variables. Therefore, a practitioner may want to 
have tighter balance on a subset of the balancing variables. 
As an example, a researcher may want to use Poisson 
sampling for simplicity but also have some control on the 
random sample size. A random sample size can complicate 
study planning and is a large contributor to the variance of 
estimators. Balanced sampling can be used to reduce the 
variation in sample sizes by balancing on the variable p,, 
which is the initial first-order inclusion probability. For 
Fuller’s rejection procedure, the variance of the sample size 
increases when the number of balancing variables increases 
and the rejection rate is held constant. The rejection 
procedure can be altered so that the p, balance is tighter 
than the balance for other variables. 

One approach to increasing the balancing on a subset of 
variables is to change the rejection test function. The order 
of the approximation to the first and second-order inclusion 
probabilities in Fuller (2009a) remains the same when the 
variance matrix in the rejection quadratic form is replaced 
with a symmetric positive definite matrix of the same order. 

To determine weights for weighted rejection sampling, it 
is convenient to transform the balancing variables so that 
V (Zyr|Fy) is a diagonal matrix. The weighted rejection 
sampling test statistic is 


m 


SHA | Fv Zur. g TR) a (12) 

q=l 
where m is the number of balancing variables, z, is the es 
balancing variable, and c, are selected weights. The weight 
on the first variable z,, = p, can be set large relative to the 
weights on other variables to reduce variation in sample size. 
The transformation is the Gramm-Schmidt transformation 
using the design variances under the initial design. Balancing 
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is done on the transformed variables, but the first variable is 
not transformed. The transformed variables have uncorre- 
lated Horvitz-Thompson estimators. Balancing on the trans- 
formed variables will still balance the original variables since 
each transformed variable is a residual from a regression 
operation on preceeding variables. 

Equation (12) can be paralleled to the penalty term of the 
distance function underlying ridge calibration. See Rao and 
Singh (1997), Beaumont and Bocci (2008), and Chambers 
(1996). Specifically, selection of the c, weights is similar to 
the problem of selecting appropriate costs in ridge cali- 
bration. Thus, rejection sampling using (12) can be viewed 
as incorporating ridge calibration at the design stage. 

A second way to produce tighter balance on a subset of 
variables is to do rejection separately for subsets. A test 
statistic is produced for each subset and a sample must be 
accepted by all of the tests to be accepted. In the Poisson 
case, one test statistic may reject if the sample size is not 
within a specified tolerance of the expected sample size. 
This second approach requires some additional assumptions 
beyond those in Fuller (2009a), but a similar argument can 
be used to justify the procedure. 

To prove the convergence properties of the multiple test 
rejection procedure, it is convenient to consider two subsets 
of balancing variables and think of rejection being done 
sequentially on each subset. We call the two subset rejection 
procedure a two-step rejective sampling procedure. Suppose 
z, =(%,, Z5;) 1s the balancing vector and the original design 
is denoted as p(-). The procedure is as follows. 


Step 1: Select a sample using p(-) and reject samples 
with the balancing condition (8) on the first subset z,, 


QO, = Zur. — 21) V Gur [Brodin Sette eo Oe 


Step 2: Use the accepted sample from step | to check the 
balancing condition (8) on the second subset z,, 


Q, = (Zur, 2 —2y,2) V Zur | Fy )a ane lng) = 72: 


Reject the sample if the condition is not satisfied and repeat 
Step 1. 

In both weighted and two-step procedures, trial and error 
is likely needed to choose y’s in practice. In the weighted 
procedure, the quadratic form becomes a sum of multiples 
of y° random variables, which makes selection of y more 
difficult than in the unweighted case. We used moment 
matching approximations to select y ’s that provide rejec- 
tion rates close to desired, but then resorted to small 
simulations to determine the rejection rate as a function of 
y. For the two-step procedure, we used a y° approximation 
to select a y, that gave approximately the desired rejection 
rate at the first step, and used second y* approximation to 
select an initial y, that gave approximately the desired 


bl 


rejection rate at the second step. The second parameter y, 
was adjusted in order to achieve the target overall rejection 
rate. The choice of y’s in the two-step procedure is sub- 
jective because many combinations of y, and y, can 
produce the same overall rate. In practice, a practitioner 
likely will set a tight bound for the first variable subset and 
loose bounds on the remaining balancing variables. 

The large sample mean and variance of the regression 
estimator under the two-step rejective sample are the same 
as those of the regression estimator for the original design. 
Also, the usual estimator of variance under the original 
design for the regression estimator is appropriate for the 
two-step rejective sample. The proof of this statement is an 
extension of the proof in Fuller (2009a) and can be provided 
upon request. 

To examine some properties of the two procedures, the 
Monte Carlo simulations for the Poisson initial sample 
design were repeated with the variable p, separated from 
the other three variables. The balancing vector was trans- 
formed so that the variance matrix of the Horvitz-Thompson 
total estimators was diagonal. For the weighting procedure, 
the weight on the p, component of the quadratic form was 
set to 1.5, the weights on the other components were set to 
1, and y was set to 0.627. This weighting procedure 
restricted the samples to those with sample sizes ranging 
from 18 to 22. For the two-step procedure, any sample with 
a sample size outside of the range from 18 to 22 was 
rejected in the first step and then the quadratic form for the 
remaining three variables was checked using a y of 0.63 for 
the second step. Given the good performance of the variance 
estimator V reg) in (10), Table 4 only contains its Monte 
Carlo averages values ave(V(¥,., ))- 


Table 4 
Properties of rejection samples with adjustments based on 
Poisson sampling of expected size 20, and 95% rejection rate 


Weighted Two-step 
bias, (Yur) /4/ pur) -0.005 -0.014 
bias, (Vreg)/ [Vy Fur) 0,003 0.002 
V. Var )/V (Yur) 0.210 0.217 
Vi. (Vreg) Vp ur) 0.132 0.132 
ave(V (Vreg)) Vp Yur) 0.121 0.121 
V,,(n) 1.237 1.902 


Results for expected sample size of 20 and a rejection 
rate near 95% were similar for the two adjustment 
procedures (Table 4). The Horvitz-Thompson estimator for 
the weighted procedure performed slightly better than the 
Horvitz-Thompson estimator for the two-step procedure. A 
reason for this discrepancy is that the weighted procedure 
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had much less variation in sample sizes (V_(”) in the last 
row of Table 4). Additional simulations with larger expected 
sample sizes gave similar relative variances. The regression 
estimator performed at roughly the same efficiency for the 
two procedures. The Horvitz-Thompson estimators using 
the initial design inclusion probabilities for these adjustment 
procedure performed slightly better than the Horvitz- 
Thompson estimator for the rejection procedure that did not 
place additional control on the sample size. 


6. Discussion 


Rejection sampling and cube sampling produce roughly 
equally performing regression estimators. Balancing pro- 
vides major gains when the initial design provides little 
control on the auxiliary values entering samples. A well 
stratified sample design provides many of the benefits of 
balancing on a continuous variable. However, further bal- 
ancing after stratification can still yield small mean squared 
error gains for regression estimators. Additionally, bal- 
ancing could be used to prevent negative weights produced 
by regression estimators (Fuller 2009a). 

For the simulations, the rejection rate was fixed at 90% 
for the larger population. When the population and sample 
sizes are increased, the rejection rate can be increased while 
still maintaining a large set of possible samples. Additional 
simulations were carried out with rejection rates near 99%, 
but the results were not presented since the differences 
between the results with 95% and with 99% were very small 
and the bias of y,,, remained negligible. The marginal 
variance reduction due to balancing decreases as the 
balancing condition is tightened. 

In some special cases, an investigator may want to 
balance tightly on some variables and weakly on others. 
Gains can be made by choosing different weights for 
different variables or by dividing the variables into separate 
test sets. The weighted and two-step rejection procedures 
performed comparably, so the decision between procedures 
will largely be based on the ease of implementation. 
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Appendix 
Start with 
V Greg Fv) = V Greg ~ Fn Fw): 
Let 
Vy = la By 
and note 


' 
y, = 2; By + ey; 


zy 
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term. Then the expansion of (13) is 
B = Nk Mea az 0, pe, + ONE?» 
icA 
For construction of confidence intervals for y,, it is enough 
to consider the variance of the linearized term. Therefore 


consider in the notation of Sarndal, Swensson, and Wretman 
(1992), 
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where 
b, = 2,9; D)' &- 

The variance of the HT estimator for the mean of 5, under 
Poisson sampling is 

Ds (l= p;) py" 6, b;. 

ieU 
Next apply that @ =1-— p, to obtain the asymptotic variance 
approximation to the linearized part of y, 


/ reg 
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The variance estimator is obtained by replacing the popu- 
lation totals with HT estimators under Poisson sampling and 
incorporating a degree of freedom correction to the front of 
n/(n—s) due to the small sample size. 
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The multidimensional integral business survey response model 


Mojca Bavdaz | 


Abstract 


Knowledge of the causes of measurement errors in business surveys is limited, even though such errors may compromise 
the accuracy of the micro data and economic indicators derived from them. This article, based on an empirical study with a 
focus from the business perspective, presents new research findings on the response process in business surveys. It proposes 
the Multidimensional Integral Business Survey Response (MIBSR) model as a tool for investigating the response process 
and explaining its outcomes, and as the foundation of any strategy dedicated to reducing and preventing measurement 


errors. 


Key Words: Accuracy; Data collection; Economic statistics; Business survey; Measurement error. 


1. Introduction 


Measurement errors represent the gap between an ideal 
measurement and the obtained survey response (Groves, 
Fowler, Couper, Lepkowski, Singer and Tourangeau 2004). 
To efficiently prevent or reduce the occurrence of measure- 
ment errors, it is necessary to know how the process of 
responding to survey questions evolves and what influences 
its course. Because work to reduce errors in business 
surveys has traditionally focused on sampling, frame, and 
nonresponse errors and, to a lesser extent, on measurement 
errors (Willimack, Lyberg, Martin, Japec and Whitridge 
2004), knowledge of measurement errors and the underlying 
causal mechanisms is still largely limited in business 
surveys. This article attempts to fill that gap. 

Most studies that examine the causes of measurement 
errors in business surveys are a product of pretesting 
research. As a result, most such studies are hypothetical (e.g., 
Morrison, Stettler and Anderson 2002) or tentative (e.g., 
Phipps, Butani and Chun 1995) as opposed to being based on 
actual data collection (e.g., Hak, Willimack and Anderson 
2003). The abundance of pretesting results, which are usually 
bound to a particular survey, contrasts with the scarcity of 
quality assessment research (e.g., Giesen and Hak 2005) and 
with the shortage of generalization and linkages to the 
response process. Many studies focus on a particular aspect 
of the response process. For instance, Ponikowski and Meily 
(1989) examined the availability of data that business 
surveys require; Ramirez (1996) investigated respondent 
selection in business surveys; Jenkins and Dillman (1997) 
considered the design of business questionnaires; O’Brien 
(2000) and Willimack (2007) explored the respondent’s role 
in the establishment survey response; Greenia, Lane and 
Willimack (2001) concentrated on business perceptions of 
confidentiality and on the closely connected issue of data 
sharing among statistical organizations; and Willimack 
(2003) exposed comprehension issues. Recently, more 


attention has been dedicated to the development and testing 
of electronic business questionnaires (e.g., Snijkers, Onat and 
Visschers 2007) and their editing (e.g., Nichols, Murphy, 
Anderson, Willimack and Sigman 2005), while more 
frequent complaints about the costs that statistical reporting 
imposes on the business community have triggered research 
on the response burden (e.g., Hedlin, Dale, Haraldsen and 
Jones 2005). 

The first study to systematically address the entire 
response process in establishment surveys was a general 
model of the survey response process for factual 
information, which Edwards and Cantor (1991) presented. 
Biemer and Fecso (1995) combined the cognitive model of 
Edwards and Cantor’s (1991) survey response with a 
statistical model that tried to quantify measurement errors 
by their sources. Another attempt to grasp the entire 
response process in business surveys was made in 1998- 
1999, when the U.S. Census Bureau conducted unstructured 
qualitative interviews on statistical reporting. The study 
served as a basis for two business survey response models: 
the hybrid response model for establishment surveys by 
Sudman, Willimack, Nichols and Mesenbourg (2000) and 
the complete model by Willimack and Nichols (2001). Most 
recently, Lorenc (2006) suggested examining the entire 
response process on the basis of the idea of socially 
distributed cognition and using an establishment as a unit of 
observation. 

These models identify many essential aspects of the 
response process in business surveys and offer some 
concepts for them, but they treat many issues only partially. 
This was an incentive for a comprehensive study of the 
response process of a selected business survey making 
possible further development of the business survey 
response model. This article presents the Multidimensional 
Integral Business Survey Response (MIBSR) model and 
discusses its contributions. 
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2. Empirical study 


The aim of the empirical study was to build a conceptual 
framework of the response process — a response model — by 
examining from start to finish the actual response process to 
a typical business survey in a real business environment. 
The qualitative research interview was the primary method 
of investigation. The method was implemented using 
various techniques (mainly retrospective probing and 
ethnographic interviewing but also thinking aloud), two 
modes (in person and by telephone), and different inter- 
viewees (people from the participating business, question- 
naire administration experts from the statistical organization, 
and subject-matter experts). In some cases on-site observa- 
tion and analyses of micro data complemented those 
techniques. Considering all the variables, a range of ap- 
proaches had to be developed (for more details, see Bavdaz 
2009). On-site visits were arranged around two consecutive 
deadlines for the questionnaire’s completion in 2005. An 
attempt was made to contact all key people involved in the 
response process. 

The selected survey — the Quarterly Survey on Trade — 
was a business survey conducted by the Statistical Office of 
the Republic of Slovenia on a sample of approximately 
1,600 legal units performing trade activities. It had classic 
characteristics of business surveys: a recurring mandatory 
governmental mail survey. Its instrument was an eight-page 
paper questionnaire and instruction and classification 
booklets. The questionnaire consisted of an introductory text 
and four sections, one referring to the business as a whole 
and the other three each referring to one kind of trade 
activity (commission trade, wholesale, and retail). All 
sections asked for sales and employment data. In addition, 
there were questions on sales breakdowns, stock, activity 
codes, and size and number of stores. Nonresponding units 
received up to three reminders and, ultimately, a telephone 
call. The final response rates were generally high, greater 
than 90%. Major deviations and inconsistencies discovered 
during editing procedures also required telephone calls to 
businesses. 

The final sample in this study consisted of 28 businesses 
required to complete the Quarterly Survey on Trade. 
Previous studies resulting in models of the response process 
applicable to business surveys were based on small samples 
as well: 24 establishments (Edwards and Cantor 1991), 30 
large multiunit companies (Sudman et a/. 2000; Willimack 
and Nichols 2001), and 7 schools (Lorene 2006). This is 
consistent with exploratory interview studies, which tend to 
have small sample sizes of “around 15+10” (see Kvale 
1996, page 102). The selection of businesses aimed to cover 
the heterogeneity of response processes. Because business 
size can be defined as the single most important business 
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characteristic that is assumed to influence or be related to 
the characteristics of the response process (e.g., O’Brien 
2000), businesses were selected from all size classes. 

Several measures boosted the validity of the research 
design. The businesses were selected from different size 
classes, including some of the largest ones in trade but also 
some from nontrade primary business activity. A few 
businesses refused to cooperate, mainly because of the work 
overload. Nevertheless, caution is necessary when applying 
findings to nontrade and overworked businesses. The study 
included people with different roles in the response process. 
Substantial effort was made to obtain participation and 
organize visits during the time the respondents were 
completing the questionnaire or right afterward so as to 
minimize the loss of information from their memory. The 
short time lags that occurred in some cases did not seem to 
be so damaging for remembering a frequently repeated and 
well-documented process, given the advance announcement 
of the impending on-site visit. Interview questions directed 
respondents to report how they last filled out the 
questionnaire (e.g., when the books closed that month, how 
much time they spent, who signed the form and how fast), 
and respondents generally supported their reports by data 
from paper and electronic documentation they used to fill 
out the questionnaire. All this helped distinguish their last 
engagement from the usual one. 

The interview as the primary research method was in 
some cases combined with observation. The interviews 
were tape-recorded and transcribed. More repeating patterns 
emerged as the fieldwork progressed, though diminishing 
returns of each consecutive on-site visit were noted toward 
the end of the fieldwork. The findings from the on-site visits 
were compared with the observations of the survey staff and 
subject-matter experts, quantitative data (where available), 
and previously published research. Alternative explanations 
were considered. Last but not least, the selection of a typical 
business survey made the generalization to other business 
surveys more plausible. As Yin (2003) suggests, all steps in 
the research were carefully documented to establish a chain 
of evidence and ensure high reliability of findings. 


3. The MIBSR model 


3.1 Presentation of the model 


One of the main study results is the Multidimensional 
Integral Business Survey Response (MIBSR) model, which 
integrates previous research findings and new findings from 
my empirical study. The MIBSR model explicitly distin- 
guishes between processes occurring at the individual level 
and others taking place at the organizational level, which is 
the business level in this case (see Figure 1). The cognitive 
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processes of comprehension, retrieval, judgment, and 
response occurring at the individual level are taken from 
Tourangeau’s (1984) response model. They reflect the 
mental processes of people involved in the survey response 
that relate to the actual answering of particular survey 
questions as compared to the processes that refer to the 
organization, information support, and authorization of such 
answering, which occur at the business level. Contrary to 
the typical situation of surveys of individuals, parts of the 
process, such as requesting data from another participant or 
retrieving data from business records, are visible through 
participants’ physical actions. By using the survey level, the 
MIBSR model also allows for the possibility of conceptu- 
alizing the response process over several implementations of 
a survey or over several surveys (indicated by the arrows in 
Figure 1). 


Business organization 
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Figure 1 MIBSR model 


The survey response task may involve several business 
participants who can enter and exit the response process at 
various points in time; but for the sake of clarity and 
simplicity, they are all depicted together. Business partici- 
pants take part in organizational processes while going 
through their own cognitive processes; thus, they are a 
unifying link between processes at the individual and 
organizational levels. They may adopt one or more of the 
roles with a different influence on the response process, 
namely a gate-keeper (e.g., a receptionist, boundary-span- 
ning unit), an authority, a response coordinator, a data 
provider, or a respondent. Although Figure 1 presents 
participants from a single business organization, successful 
completion of the task may require either the participation 
of people who provide outsourced activities or communi- 
cation with survey staff. 

The response process is triggered when the survey 
instrument crosses the business’s boundaries. The MIBSR 
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model addresses the business response to a survey request 
presupposing a positive decision about participation in the 
survey. The examination of this decision, potentially leading 
to nonresponse, goes beyond the scope of this article even 
though it represents a natural introduction into the response 
process and may influence its course. The model suggests 
the most typical sequence of processes, although in practice 
some may be left out, repeated, or occurring in a different 
sequence. The following sections focus only on elaborated 
and newly added insights into the response process. 


3.2 Organizational level 


3.2.1 Organization of the survey response 


Participation in a survey generally entails some prepa- 
ratory activities due to work distribution and specialization 
in organizations. It requires an answer about who will 
perform the survey response task and when it will be done; 
both answers provide clues about how the task will be 
carried out. The study provided evidence that the two steps 
could be intrinsically linked. In fact, the selection of people 
for the survey response may itself indicate the priority 
assigned to the task in the organization. For instance, in 
some accounting firms and larger businesses, chiefs 
performed the task themselves, although they could have 
delegated the task, which may indicate a certain importance 
of the task, while the fact that many respondents received 
the task as novices may indicate its low priority. In contrast, 
priorities at the individual level were not always consistent 
with priorities at the organizational level. For instance, even 
if tax reporting gained higher priority than. statistical 
reporting at the organizational level, this was irrelevant for a 
survey respondent not involved in tax reporting. | therefore 
examined the selection of business participants and the 
scheduling of the survey response task together within the 
organization of the survey response. The result is an 
expanded list of factors potentially influencing the organi- 
zation of the survey response task (see Figure 2). 

Tradition, customary practices, established procedures, 
and information location mainly influence the selection of 
business participants, which is an organizational matter, 
while other factors operate at both the organizational and the 
individual levels. Tradition dictates reliance on previous 
participants in recurring surveys when the same people 
repeatedly participate in the response process of the same 
(longitudinal) survey. Some study respondents claimed they 
had been “filling it out for years.” Some had been filling it 
out since they started the job or since a colleague retired, 
went on a longer sick leave, left the job, and so on. 


Statistics Canada, Catalogue No. 12-001-X 


84 Bavdaz: The multidimensional integral business survey response model 


Competing tasks 


Information 
location 


Customary practices and 
established procedures 


Tradition 


——» Influence at organizational level 


——— Influence at organizational and individual levels 


Attitudes to survey response task 


Record formation 


Data delivery 


Organization lacy. 


Figure 2 Factors influencing the organization of survey response 


Many processes in organizations draw on customary 
practices and established procedures, which leads to the 
selection of the usual participants. This means that even 
when a new survey request reaches the business, the 
business will likely proceed in the same way as with 
previous survey requests because of the relatively stable 
distribution of work. In fact, some of the respondents in this 
study explained that the survey questionnaire would often 
be directed to the same department or person, who usually 
replied to such requests even if no formal policy on surveys 
existed. As one respondent clarified, “They prefer to bring 
them to me-—this is the only policy.” Some respondents 
knew which types of surveys they received, saying, for 
instance, “I’m doing all statistics except wages,” or “I’m 
doing all statistics, also for the Bank of Slovenia, except 
Intrastat.” Even in larger businesses, the same person often 
filled out several different survey questionnaires; one person 
completed all survey questionnaires that required financial 
data, be it for the Bank of Slovenia, the Statistical Office, or 
the Agency for Public Legal Records; others provided a list 
of specific surveys that they would complete, such as 
surveys on investments, fixed assets, value added, and so 
on. 

Information location is an essential factor that influences 
the selection of business participants from the perspective of 
measurement errors. It refers to sufficient knowledge to 
provide an accurate survey response, including adequate 
access to records, if necessary. In this study, many 
respondents expressed that they had been chosen because of 
their access to data, for instance, “I have the data and I know 
how to retrieve them.” 

Competing tasks relates to the assignment of people and 
order to the tasks. It usually influences the choice of 
business participants at the organizational level when 
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alternative possible participants are compared, as well as the 
scheduling of the survey response task at the individual 
level when the priorities of a participant’s several tasks are 
considered. Study respondents in several, mainly smaller 
businesses agreed that they give low priority to the survey 
response task when they schedule their work: “VAT (value- 
added tax), debt recovery, bookkeeping . . . all has priority 
over statistics.” Another respondent said that she “wouldn’t 
think of doing the survey on the day all the book entries are 
done” but instead checks “‘the balance sheet, . . . liabilities, 
how the payments stand, how much debt there is, the 
financial situation.” Another explained the work process as 
“internal reporting first, current affaires next, statistical 
reporting afterwards.” In a few larger businesses, however, 
respondents said that they completed survey questionnaires 
as soon as data became available or final. 

Similarly, attitudes to the survey response task can be 
examined at the organizational level through formal policies 
on surveys and the informal reactions of authorities as well 
as individual perceptions. Businesses in this study did not 
have any formal policies on surveys, though the discourse of 
authorities in some companies indicated their negative 
attitudes: “it’s only statistics; prepare something.” Organi- 
zational attitudes may affect the organization of the survey 
response, through potential consequences for the business, 
particularly opportunity costs, penalties, and damage to the 
public image. Most participants expressed a negative 
attitude toward surveys, describing them as “a necessary 
evil’ and “redundant” or “additional” work. Individual 
attitudes toward surveys may contribute to the early, timely, 
or late scheduling of the task; they may also influence an 
individual’s inclusion or exclusion in the survey response 
task. 
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Record formation and data delivery are primary in the 
scheduling of the response tasks. The timing of record 
formation determines when the records with required data 
about the business were created and took on the acceptable 
or desirable form, especially when the data become final. 
Respondents in larger businesses and businesses with 
foreign ownership typically referred to internal deadlines for 
“closing the books” or the VAT submission deadline. Data 
delivery is relevant in those cases where the participant must 
rely on other people to deliver required data. This partic- 
ularly applied to accounting firms in this study. However, 
the timing of record formation and data delivery may vary 
by the kind of data requested, so that the latest record 
formation and the latest data delivery, eventually, determine 
the actual scheduling. For instance, some respondents 
explained that more time was necessary to get the correct 
value of stock because of lags in recording incoming 
invoices as compared to sales figures. 

After the organization of the survey response task, the 
task can be realized, though it is sometimes necessary to 
further refine the selection of business participants or the 
scheduling to provide for all requested items, absence from 
work, and other circumstances. 


3.2.2 Retrieval of information from the business 
information system 


The capacity of the business information system (BIS) is 
the key factor that influences the response process and its 
outcome in business surveys. The BIS does not consist of 
the technological element only; it also includes people 
(Avison and Elliot 2006). The human capacity of the BIS 
relevant for the business survey response is mainly reflected 
in cognitive processes at the individual level (see section 
3.3), while its technological capacity is determined through 
business records at the organizational level. The study 
showed that formation of business records depends on 
internal and external factors, though the line between the 
two groups is blurred (see Figure 3). 

External factors—legal obligations, standards, and 
benchmark practices — are imposed on companies from the 
environment and dictate the content of business records 
through cogency or the threat of sanctions. Legislation, 
regulations, and other forms of power with the law set out 
legal obligations. With respect to that, study respondents 
mainly mentioned mandatory compliance with accounting 
standards and the requirements of tax authorities. The latter 
could refer to the business as a whole (e.g., VAT reports) or 
to particular items (e.g., excise duties on tobacco products). 
Other mandatory requirements may relate to contributions, 
securities, insurance, environmental issues, and so on. 
Participants usually noted the compulsory character of 
governmental business surveys, although the lack of 
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sanctions for nonresponse or a late response made some 
participants question this; furthermore, changing record 
formation for statistical purposes only was unthinkable to 
most study participants. Standards are a softer form of 
external factors: they are not mandatory, but are expected to 
be followed in most cases. Two examples from the study 
include the use of a classification based on the European 
Article Number barcode standard and recommendations 
from accounting authorities. The study suggested that 
standards were not used in the case of specific reasons; for 
instance, the information systems of the smallest retailers 
did not support barcode use. Benchmark practices are the 
least influential group of external factors. They refer to good 
examples of practice that have gained some recognition and 
authority by reputation (and not by law or institutional 
power). For instance, some study respondents mentioned 
obsolete software versus current standards, while others 
stressed powerful capabilities of their software and _ its 
positive influence on data provision. 


INTERNAL FACTORS EXTERNAL FACTORS 


Characteristics of 
business activity 


Legal obligations 
Embeddedness Standards 


Disposition Benchmark practices 


Cogency / 
Sanctions 


Management 
needs 


BUSINESS RECORDS 


Figure 3 Factors of record formation 


External factors drive data homogeneity and com- 
parability in business records across companies, at least 
within similar economic activities. They provide the 
framework in which companies develop their own solutions 
for business records according to internal factors unless 
adhering to compulsory requirements more than fully 
satisfies data needs for running the business as was the case 
in small, local companies. Internal factors of record 
formation include characteristics of business activity, such 
as the size, type, and diversity of the business activity; 
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embeddedness in the business environment; and the 
disposition to forming records. 

The size of the business activity plays a crucial role in 
record formation because it leads to a differential overview 
of an activity. In the study, most larger companies had an 
abundance of data. Business records provide information that 
cannot be gained from participation or observation only. 
That said, the size of the business activity is relative, 
especially if the size is observed only within legal boundaries 
or national borders. Therefore, it is better to speak about the 
embeddedness in networks of various kinds. In the study, for 
instance, a couple of smaller businesses had a foreign owner 
that demanded comprehensive reports to overcome the 
distance and manage the business remotely, and another 
small business had to use the sophisticated software of a 
business partner because it was its major supplier. The study 
also showed how different types of activities influenced the 
kind of available records; for example, wholesale businesses 
that typically put recipients on their invoices had more 
information on their buyers than businesses in retail that 
typically issued receipts without indicating the name. High 
diversity of business activities also is a major challenge for 
record formation in most businesses; in general, smaller 
businesses had renounced the use of detailed records and 
were forced to make estimates instead. Last, disposition 
refers to the prevailing attitudes of people in the business to 
various aspects of record formation, such as the inclination 
toward data, information technology, and change. Some 
businesses relied heavily on evidence-based decision making 
and thought highly of data; others showed enthusiasm for the 
possibilities of information technology, but a few others saw 
no usefulness in data. 

Factors of record formation influence the availability of 
data in business records and their compliance with survey 
definitions. Data availability appears at the intersection of 
technological and human capacity in the business; 
knowledge is required to extract data from the BIS 
conditional on their existence. Several levels of answer 
availability in the BIS apply to survey questions (see 
Figure 4); their naming was inspired by the determination of 
cognitive states in Beatty and Herrmann (2002) and is in 
principle consistent with that proposed by Lorenc (2007): 


(a) A datum is accessible — the required answer may be 
readily available. In this study, a typical example is 
total sales revenue, which is readily available to a 
person in accounting, or the number of employees, 
which is readily available to a person in the 
personnel department. 

(b) A datum is generable—the required answer is not 
readily available to any person; the available data 
represent a basis for generating the required answer 
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through manipulation. In the study, for instance, 
sales revenue in a particular trade activity was not 
always readily available, but it was possible to derive 
the exact figure by consulting two separate records 
(e.g., the general ledger and commercial records). 

(c) A datum is estimable —the required answer is not 
readily available to any person; the available data 
represent an approximation of the required answer or 
a basis for estimating the required answer through 
manipulation. In the study, a sales breakdown by 
commodity groups (e.g., food, beverages, clothes, 
footwear) was often estimated by recategorizing 
available groups; however, those categories were 
sometimes too aggregated or too diverse to allow for 
an exact match (e.g., Christmas products, Easter 
gifts, discontinued products). 

(d) A datum is inconceivable — no available data lead to 
the required answer or its approximation; some bases 
for generating or estimating the required answer exist 
but require an unimaginable effort to produce it. For 
instance, a company would have to classify more 
than ten thousand invoices monthly to arrive at an 
exact breakdown of sales by kind of buyers. 

(e) A datum is nonexistent—there are no bases for 
estimating the required answer. In the study, a cash- 
and-carry store could not distinguish between 
different kinds of buyers because they issued the 
same kind of nameless invoices to all customers, 
companies and individuals. 


Because data availability varies across people in a 
business, it may be useful to determine answer availability 
at the individual level. In this case, a distinction has to be 
made between an answer that someone can obtain directly 
and an answer that they can access only through another 
person. 


LEVELS OF ANSWER 
AVAILABILITY 


LIKELY RESPONSE 
OUTCOME 


Accessible Exact datum 


Approximation 
Generable 
Solid estimate 


Estimable 


Rough estimate 


Figure 4 Levels of answer availability and likely response outcome 
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The final response outcome is conditional on the level of 
answer availability and may range from an exact datum to 
item nonresponse (see Figure 4). A measurement error 
occurs whenever the response outcome deviates from the 
exact datum. When a datum is accessible or generable, the 
response outcome is likely to be an exact datum, although 
the possibilities of committing a measurement error increase 
if data have to be accessed through other people or 
manipulated. When a datum is estimable, the response 
outcome may be an approximation with a negligible mea- 
surement error or an estimate with a minor or substantial 
measurement error. An inconceivable datum may, at best, 
lead to a rough estimate. When respondents have no 
adequate bases to provide a response, they may make wild 
guesses resulting in blunders or skip the question, which 
leads to item nonresponse. 


3.2.3 Authorization of the business response 


Authorization is the final opportunity for corrective 
actions before the business response is forwarded to the 
survey organization and documentation archived. Most 
businesses in this study found this organizational step 
inconsequential and even skipped it. In more than half of 
businesses, respondents signed the questionnaire themselves 
because “they have the mandate to sign such things” and 
“the director is very rarely present” or “does not deal with 
such things.” Still, even in those cases, some respondents 
mentioned that the director had been informed about that 
procedure. In several businesses, the superior signed the 
questionnaire for the sake of formality and no verification 
procedures were in place because “the director trusts us” or 
“doesn’t have the necessary data,” or because “we work this 


ae 


way. 


REALITY 


Knowledge 


of reality 


MEASUREMENT 
PROCESS 
From concepts to variables 
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A superior was typically present in the largest compa- 
nies, through formal authorization or informal notification. 
Internal verification was rare, which could be the 
consequence of preceding consultations with the superior. 
Accounting firms usually delivered the completed question- 
naire to the business for signature, though businesses 
sometimes also signed the blank questionnaire in advance. 


3.3 Individual level 


Given the level of answer availability in the BIS, it rests 
on the performance of cognitive processes and accom- 
panying physical actions (especially interaction with 
computers) at the individual level to determine the final 
response outcome. The MIBSR model proposes that three 
inherently linked types of knowledge are relevant for these 
processes: knowledge of business reality, knowledge of 
record formation, and knowledge of business records (see 
Figure 5). Although it may be difficult to disentangle the 
three types of knowledge in practice, the study seems to 
suggest that every type is particularly influential for one 
kind of cognitive process. 

The division of cognitive processes into comprehension, 
retrieval, judgment, and response derives from Tourangeau’s 
(1984) response model. In business surveys, these processes 
may not be defined as easily as in surveys of individuals 
because the initial organization of the response may involve 
only a brief and superficial consideration of the survey task 
with barely any impact on the later response process or a 
thoughtful reflection on the questions. The study mainly 
focused on respondents’ cognitive processes because it is 
their task to answer survey questions. Nevertheless, ob- 
servations of other business participants are provided where 
available. 
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Knowledge 
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records 


Business = 
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Knowledge 
of record 
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Figure 5 Knowledge relevant to the business survey response 


Statistics Canada, Catalogue No. 12-001-X 


88 Bavdaz: The multidimensional integral business survey response model 


3.3.1. Comprehension 


In comprehension processes, respondents interpret the 
survey request for data, which usually is in the form of 
labels instead of questions. The MIBSR model suggests 
that, for comprehension processes, knowledge of business 
reality is particularly important. Business reality refers to the 
activities the business performs to subsist and to the division 
of work across locations and individuals. Knowledge of 
business reality thus presupposes acquaintance with every 
aspect of the business: who does what, what activities the 
business is involved in and how they are carried out, how 
decisions are made, why the business situation is as it is, 
how it evolved through time, and so on. Because larger 
businesses tend to be complex with technical and social 
divisions of labor, establishment of branches, organizational 
hierarchy, and decision-making structure (Tomaskovic- 
Devey, Leiter and Thompson 1994), it can be expected that 
fragmentation of the knowledge of the business’s reality 
increases with business size. 

This knowledge is essential in establishing whether 
survey questions are applicable to the business and 
providing correct answers afterward. In fact, no business in 
the study filled out all survey items. Respondents had to fill 
out only sections that applied to the kinds of trade they 
performed. Survey questions also required them to select 
applicable commodity groups, kinds of employment, kinds 
of buyers in wholesale, kinds of payment in retail, and so 
on. The required knowledge of business reality was 
occasionally specific: one respondent, for instance, needed 
information about the relationship between the company as 
the franchisor and their franchisees to avoid double counting 
or skipping some items across the businesses. 

A major obstacle to using knowledge of business reality 
for correctly understanding survey questions was the 
incomprehension of economic and accounting concepts or 
their confounding with other concepts. For instance, one 
respondent had problems distinguishing between the 
concept of trade, which includes repackaging of goods, and 
the concept of production, which entails some transformation 
of goods beyond repackaging; a few respondents pondered 
over trade rendered on a commission basis because their 
activity was trade but accounting treated it as a service; 
many respondents associated retail with a store rather than 
with individuals as final consumers, regardless of the kind 
of buyer; one respondent defined wholesale as “everything 
that is not paid with cash” instead of linking it to nonfinal 
consumption; some respondents did not understand that 
“nontrade and nonmanufacturing organizations” were 
service providers; others did not understand the difference 
between merchandise and material, because the latter is an 
input to production (not trade) in accounting terminology 
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and takes on another meaning in colloquial language, such 
as construction or building material. 

Study respondents often used their own definitions to 
interpret survey questions. The same is true for those 
business participants who provided data on request without 
actually seeing the questionnaire and/or instruction booklet. 
This, for instance, happened in a few larger businesses 
where data providers completely relied on their own 
definitions of the sales space when providing data on store 
distribution by size of the sales space because additional 
explanations were given only in the instruction booklet. 


3.3.2 Retrieval 


In retrieval processes, the data and information required 
for the survey response are located and brought forth. In 
business surveys, the data usually reside in business records, 
not in memories, but knowledge is crucial for their 
extraction and interpretation. The retrieval thus mainly rests 
on knowledge of the business records, which refers to the 
contents and location of business records in the business and 
the possibilities of data access, including familiarity with 
applications and the people in charge of them. 

Study respondents mainly exhibited good knowledge of 
the business records they worked with. In a couple of 
businesses where superiors participated in the response 
process, the superiors were not abreast of all details of the 
records and had an assistant perform the retrieval—but they 
had excellent insight into the business reality and knew how 
it converted into records. Even perfect knowledge of the 
business records, however, did not always suffice for exact 
answers. When the business records did not register all 
necessary data, knowledge of the business reality became 
critical for making correct inferences and good estimates. 
This sometimes happened in larger businesses and 
accounting firms where respondents knew the records very 
well, including the chart of accounts and its codes, but knew 
the assortment of merchandise only vaguely. As a result, 
they had to use estimates when classifying sales by 
commodity groups, as their acquaintance with the business 
activity was incomparable to a comprehensive, firsthand 
insight of sales personnel. In smaller businesses, lack of 
necessary data in records sometimes meant complete 
reliance on memory instead of records; a respondent, for 
instance, arrived at employment in wholesale by retrieving 
the number of people in relevant workplaces, namely 
chauffeurs, people who worked in the warehouse, 
salespeople, and office clerks. 


3.3.3. Judgment 


Judgment refers to the compilation of all retrieved data 
and information to formulate an answer. In this study, it 
frequently entailed some data manipulation or handling, 
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such as summation, balance with a residual, recategorization, 
and application of proportions. Judgment is mainly supported 
by knowledge of record formation. This knowledge provides 
information on how the business reality translates into 
business records and ensures that captured data are not 
considered isolated figures, codes, or words but take on a 
certain meaning representing the processes and objects 
measured. It therefore represents a link between knowledge 
of the business reality and knowledge of the business 
records (see Figure 5). Its importance was, for instance, 
noted during the observation of a respondent who was 
filling out the questionnaire and had to struggle with an 
inconsistency in the retrieved sales data. To identify the 
mistake, she systematically analyzed nonsales activities in 
the observed period and the correctness of their encoding in 
the records to finally discover a transaction that should not 
have been included in the sales figures. 

However, lack of knowledge could not explain some 
judgments with an unfavorable response outcome, so the 
study looked more closely into principles that guided judg- 
ment. Among the most pervasive principles encountered in 
the survey response process under study was the principle of 
continuity, which advocates the use of the same response 
strategy in recurring surveys — even if this leads to errors. 
Continuity was sometimes considered within a year but also 
across years. It seemed to be strengthened by the lack of 
negative feedback from the statistical organization and its 
presumed satisfaction with the data. The study identified 
several respondents who used detailed procedures of 
calculation that were quite obsolete. A respondent even 
erroneously left out the section of commission trade but 
would not change the procedure during the year to avoid 
disrupting the reported data. 

Two other principles were identified in relation to the 
principle of continuity: the principle of consistency and the 
principle of disregarding the exceptional. The principle of 
consistency implies use of the same or similar response 
strategies in the same survey questionnaire. For instance, a 
respondent who attributed various items of merchandise to 
only one commodity group in wholesale did the same in 
retail; a respondent who estimated wholesale turnover from 
VAT figures used the same approach to retail turnover, and 
so on. The principle of disregarding the exceptional implies 
ignoring new, one-off, or temporary activities. For instance, 
a study respondent inadvertently reported a temporary 
activity not reported in the questionnaire; another confessed 
the exclusion of new activities from reporting because their 
success was uncertain. The question, however, is how to set 
boundaries on the novelty and on the temporariness and 
when precisely such activities become representative of the 
business. 
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The principle of disregarding the exceptional is also 
related to the principle of disregarding the marginal, which 
advises ignoring those activities that are perceived as 
marginal to the business. For instance, some study respon- 
dents disregarded some items in sales breakdowns if they 
represented less than one percent of activity. The impact of 
the principle depends on the use of the collected data. It 
should be inconsequential if the aim is to estimate national 
totals or change. However, sales of a specific commodity 
group may be marginal to a large business but not marginal 
for the market of that commodity group. 

The business perspective principle advocates the priority 
of the business perspective as compared to a statistical 
request. In the study, data on existing organizational units 
were judged acceptable despite their divergence from the 
required units; data on various packages (e.g., a newspaper 
supplemented with a book) that were relevant from the 
business perspective were not disentangled for statistical 
purposes. 


3.3.4 Response 


The response component refers to the processes of 
mapping a judgment onto a response category and editing 
the response (Tourangeau, Rips and Rasinski 2000). In 
business surveys, mapping usually translates into matching 
available data from the BIS with response categories 
offered, which provides room for a specific form of 
measurement error: misclassification. For instance, when 
respondents had problems fitting available sales data into 
the provided classification scheme, they often chose the 
closest category, the main category, or the category “other.” 

The study also identified the presence of editing 
processes that show different aspects of business sensitivity. 
Some study respondents checked whether their selection of 
the decisive activity code was consistent with their 
registered activity, which may show a fear of nonconformity 
with administrative requirements. Not reporting people who 
helped in family businesses may reveal tax evasion. 
Although many respondents agreed that the data they 
reported in the questionnaire were considered confidential, 
there was scarce evidence of hindrance for disclosing the 
data to the statistical organization (e.g., not reporting 
detailed data on newly introduced activities). 


3.4 Survey level 


The MIBSR model introduces the possibility of concep- 
tualizing the response process over several implementations 
of a survey or over several surveys. It thus conceptually 
enables the observation of how the elements of survey 
design, which is under the control of survey organization, 
influence the response process. 
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The study focused on the impact of recurrence on the 
response process. In repeated administrations of the survey 
to the same business, the organization of the survey re- 
sponse became less relevant or irrelevant if it was a perfect 
replica of the preceding administration. The cognitive 
processes at the individual level were characterized by 
routine when the same business participants performed 
them. Many respondents admitted that they had not read the 
whole questionnaire, let alone the instructions in a repeat 
questionnaire. This also occurred in businesses that agreed 
to be observed while completing the questionnaire: after 
respondents gave the questionnaire a swift scan for any 
changes, they plunged into the retrieval processes based on 
the previously completed questionnaire or on other 
documentation and supporting notes. The comprehension 
step was thus performed superficially and pertained more to 
understanding completion of the previous questionnaire than 
it did to understanding survey requests. The retrieval 
procedures followed the previously established course and 
exhibited learning-curve effects. The respondent’s judgment 
clung to the initial approach and was unlikely to change. 
The recurrence frequently loosened up a_ respondent’s 
supervision and reduced the importance of the authorization 
or even omitted it. 

Given the appointment to the survey task of the same 
people or usual units in the business, many of them sooner 
or later had contact with survey staff, despite the common 
self-administrative mode of data collection in business 
surveys. Such contact could occur early in the response 
process and influence the respondent’s comprehension and 
judgment. This was rarely the case in the study; only a few 
respondents asked for explanations the first time they 
participated in the survey and another respondent asked for 
help when the business’s activity changed. Contacts in 
which respondents requested postponement of the deadline 
did not seem to influence the subsequent response process, 
though the same could not be claimed for respondents who 
resisted participation. All other contacts happened during a 
follow-up when the response process, or parts thereof, had 
to be performed again, which could result in an adjusted 
survey response. Although respondents mainly acknowl- 
edged the politeness of the survey staff, their calls signaled 
that something was wrong: a missed deadline, an item 
missing in the questionnaire, an inconsistency in the 
reported data. The rareness of such contacts made a 
significant impression on respondents because these 
contacts were often the only type of feedback from the 
statistical organization. 

In contrast, respondents did not always appreciate a lack 
of feedback. They expected feedback from the statistical 
organization after they first participated in the survey, but 
this generally did not happen. The lack of reaction made 
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them confident in their approach, thus reinforcing the 
principle of continuity in their judgment. However, many 
respondents reported at least one piece of data that was not 
completely accurate (or not as accurate as they would expect 
the data should be) and they perceived the lack of 
complaints as satisfaction with bad data. Some respondents 
were convinced that the statistical organization knew about 
their business activity, which is why they rarely provided 
textual descriptions of seasonal oscillations. Given these 
observations, it is not surprising that several respondents 
expressed doubts about the accuracy of statistical data or 
questioned the accuracy of data that others provided. The 
right feedback may not only be important for that particular 
survey but also for participation in other surveys because it 
contributes to general perceptions on surveys and statistics. 


4. Discussion of model’s contributions 


The dominance of written communication between the 
survey organization and businesses has moved business 
participants away from the center of statistical production 
and reduced the possibilities of insights into the process of 
responding to survey requests and the causes of measure- 
ment errors. By studying the response mechanisms and 
influencing factors, response models help bring these 
insights out and design approaches that turn this knowledge 
into an advantage. This section discusses the contributions 
of the MIBSR model with respect to previous response 
models applicable to business surveys. 


4.1 Model construction 


Two approaches were encountered in construction of 
previous models: adding some organizational steps to the 
core cognitive processes from Tourangeau’s cognitive 
model of survey response (Biemer and Fecso 1995; 
Edwards and Cantor 1991; Sudman ef al. 2000; Willimack 
and Nichols 2001) or using the organization as the unit of 
observation (Lorenc 2006). The MIBSR model explicitly 
links the processes to the level at which they occur: cog- 
nitive processes to the individual level and organizational 
processes to the organizational (in our case, the business) 
level. It also foresees the observation of the response 
process over several implementations of the same survey or 
over several surveys with different designs, which is 
particularly interesting for governmental surveys. By 
analyzing complex response processes at the appropriate 
level of observation, the MIBSR model sets up a framework 
that can also be used for quantitative modeling and 
experimental design. 
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4.2 Insights at the organizational level 


Previous models treated initial organizational arrange- 
ments in the context of respondent selection (Biemer and 
Fecso 1995; Edwards and Cantor 1991) or in separate steps 
of respondent selection and the assessment of priorities, the 
latter ranking statistical reporting to the government lower 
than most other business reporting activities (Sudman ef al. 
2000; Willimack and Nichols 2001). They also identified 
several factors that influence respondent selection, espe- 
cially the functional role, authority level, and position with 
regard to the information system (Edwards and Cantor 
1991), knowledge of the information system, terms and 
definitions (Biemer and Fecso 1995), competing job respon- 
sibilities and access to the data (Sudman ef a/. 2000). The 
MIBSR model integrates all preparatory activities in the 
organization of survey response and suggests an expanded 
list of influencing factors. The organization of survey 
response now acknowledges that delegation of the task may 
also include selection of other business participants beyond 
respondents and that priority of competing tasks is just one 
of the factors influencing the task’s scheduling. 

All previous models have paid considerable attention to 
record formation. The MIBSR model suggests a different 
systematization and extension of factors of record 
formation, initially grouped into management, regulation, 
and standards by Willimack and Nichols (2001). Because it 
is generally unlikely that the requirements of statistical 
reporting are an actual factor of record formation, the 
MIBSR model may assist the survey organization in its 
endeavors to exert influence on record formation and 
eventually obtain requested data. Taking into account 
technological and human capacity of the BIS, the MIBSR 
model defines several levels of answer availability based on 
the extent to which the answer conforms to required survey 
definitions and proposes the likely response outcome. In 
authorization of the business response, the MIBSR model 
reiterates the possibility of internal verification that Sudman 
et al. (2000) and Willimack and Nichols (2001) propose for 
the release step. Authorization is more likely sought out 
when the survey response involves legally separate units 
and more formalized and centralized organizations. 


4.3 Insights at the individual level 


At the individual level, which deals with comprehension, 
retrieval, judgment, and response (Tourangeau 1984), the 
MIBSR model further elaborates on the knowledge relevant 
to cognitive processes. Willimack and Nichols (2001) 
emphasized personal knowledge for answers directly from 
memory and knowledge of the records. The MIBSR model 
suggests that a thorough understanding of the data in 
business records and their appropriate use in the survey 
response require knowledge of the whole chain of data 
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generation, from knowledge of business reality to knowledge 
of record formation and knowledge of business records. 

As far as comprehension processes are concermed, 
Edwards and Cantor (1991) have acknowledged the prob- 
lematic use of jargon, and Sudman etal. (2000) have 
pointed to the problematic deviation of required economic 
concepts from accounting standards. The MIBSR model 
goes even further to explain that the errors may result from a 
broader issue of incomprehension of economic and account- 
ing concepts or their confounding with other concepts. 

The MIBSR model identifies several principles that help 
understand the underlying judgment processes in business 
surveys, which are consistent with examples manifesting the 
principles of continuity and consistency by Sudman, ef al. 
(2000) and Willimack, Nichols and Sudman (2002), 
respectively. These principles may also reflect satisficing 
(Simon 1957) or inertia. The use of inappropriate principles, 
especially the principle of continuity, is particularly 
strengthened by the lack of survey feedback. 

In the cognitive processes of responding, the MIBSR 
model exposes the problem of matching in business surveys, 
thus adding to the rounding error that Sudman et a/. (2000) 
discuss. It also integrates different aspects of business 
sensitivity that Edwards and Cantor (1991) have discussed 
as part of the communication step, and Sudman ef al. (2000) 
have discussed as part of the release step. The model treats 
them at the individual level where the editing occurs if the 
data are indeed sensitive. 


4.4 Insights at the survey level 


Previous models have concentrated on a_ single 
occurrence of the response process in a particular business 
survey, while the MIBSR model extends to several 
occurrences and several surveys. Among the many 
dimensions at the survey level, the study systematically 
analyzed the impact of recurrence and contact with the 
survey staff on the response process, which represents a 
further elaboration of specific instances already mentioned 
in previous models in the context of retrieval, such as 
rehearsal of the look-up (Edwards and Cantor 1991) or 
documentation of previous completions supporting retrieval 
(Sudman ef al. 2000). In addition, the MIBSR model allows 
for the presence of a contagious effect transmitting the 
experience in one business survey to other business surveys. 


5. Conclusion 


Survey organizations usually have to set aside a con- 
siderable amount of resources for processing survey data 
because the processes of responding to survey questions in 
the businesses are not performed satisfactorily. The MIBSR 
model provides further evidence on how the processes are 
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carried out and what influences them. It offers insights into 
the business perspective, which are valuable for efficiently 
seeking solutions to improve the processes and, cones- 
quently, reduce or eliminate measurement errors. The model 
may also serve as a framework for the documentation and 
systematization of existing and future knowledge on the 
causes of measurement errors in business surveys. It may be 
used as a preceding step of empirical studies on measure- 
ment errors and for a consistent explanation of empirical 
findings. Future research should continue with the appli- 
cation of the qualitative research methods to the study of 
particular dimensions of the response process, other 
business participants besides respondents and other kinds of 
business surveys. It should also embark on quantitative 
modeling of the response process and verifying the 
effectiveness of suggested improvements with experiments. 
Last, it should look into the interactions with other kinds of 
nonsampling errors. 
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Examining survey participation and response quality: 
The significance of topic salience and incentives 


Lazarus Adua and Jeff S. Sharp | 


Abstract 


Nonresponse bias has been a long-standing issue in survey research (Brehm 1993; Dillman, Eltinge, Groves and Little 
2002), with numerous studies seeking to identify factors that affect both item and unit response. To contribute to the broader 
goal of minimizing survey nonresponse, this study considers several factors that can impact survey nonresponse, using a 
2007 Animal Welfare Survey Conducted in Ohio, USA. In particular, the paper examines the extent to which topic salience 
and incentives affect survey participation and item nonresponse, drawing on the leverage-saliency theory (Groves, Singer 
and Corning 2000). We find that participation in a survey is affected by its subject context (as this exerts either positive or 
negative leverage on sampled units) and prepaid incentives, which is consistent with the leverage-saliency theory. Our 
expectations are also confirmed by the finding that item nonresponse, our proxy for response quality, does vary by 
proximity to agriculture and the environment (residential location, knowledge about how food is grown, and views about the 
importance of animal welfare). However, the data suggests that item nonresponse does not vary according to whether or not 


a respondent received incentives. 


Key Words: Survey nonresponse; Survey participation; Leverage-salience; Prepaid incentives; Item nonresponse; 


Missing data. 


1. Introduction 


Nonresponse bias has been a long-standing issue in 
survey research, as it affects all survey research regardless 
of mode (Nathan 2001). As a result, numerous studies have 
sought to identify factors that affect both item and unit 
response/nonresponse in various survey modes (Grove 
2006; Trussell and Lavrakas 2004; Davern, Rockwood, 
Sherrod and Campbell 2003; Teitler, Reichman and 
Sprachman 2003; Singer, Van Hoewyk and Maher 2000; 
Singer, Van Hoewyk, Maher 1998; James and Bolstein 
1992). While these studies have generated insightful and 
useful information about the factors that affect survey 
participation, questions about survey response still remain 
pertinent to the field of survey research in general and to our 
substantive work in particular. We are interested in 
expanding on the thoughts of Groves etal. (2000) by 
investigating whether specific characteristics of sampled 
units or demographic subpopulations in relation to a 
survey’s topical context affect the response patterns. In our 
ongoing research assessing the general public’s attitudes and 
behaviours related to the agricultural and environmental 
domain, we have become increasingly concerned about the 
level of survey participation and item nonresponse in 
distinct subpopulations. In our case, one concern is that unit 
and item nonresponse may vary among individuals or 
households that are more or less physically or socially 
proximate to the agricultural landscape, which is the focal 
area of our public opinion surveys. 


To contribute to the broader goal of minimizing item and 
unit nonresponse and address some of our concerns, we 
reconsider several factors that can impact survey partici- 
pation and item nonresponse. Specifically, we examine the 
effects of a survey’s subject context (that is, its main focus) 
on survey participation and item nonresponse. We anticipate 
that participation in a survey will be systematically affected 
by how salient the survey’s topic is to each sampled unit 
This expectation draws on the leverage-saliency theory 
(Groves etal. 2000), which anticipates that a variety of 
factors related to a survey’s main features or features made 
prominent during survey administration might impact 
participation. Our research will also reconsider the effects of 
prepaid incentives on survey response. Given that offering 
incentives to sampled units has remained an enduring and 
widespread practice in the survey industry, we think it 
behoves survey researchers to periodically reassess the 
relationship between incentives and survey participation, 
using varying contexts. Such a continuous assessment of the 
utility of incorporating incentives into surveys is important 
because we cannot assume that incentives will always work 
as intended. 

In the next section, we briefly describe the problem of 
survey nonresponse and then review research on how 
increasing the salience of some survey features and offering 
prepaid incentives affect participation and item non- 
response. The final two sections will cover the research 
design and results of the study. 


1. Lazarus Adua, The Ohio State University, 330 Agricultural Admin Building, 2120 Fyffe Road, Columbus, OH43210, U.S.A. E-mail: 
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2. Survey nonresponse and 
potential consequences 


Survey nonresponse describes the situation in which a 
sampled unit fails either to participate in the survey 
altogether (unit nonresponse) or to respond to one or more 
survey items (item nonresponse). Survey nonresponse has 
been a long-standing issue in survey research. Singer (2006) 
observes that “analysis of JSTOR statistical journals dates 
the first nonresponse article from 1945 and the Public 
Opinion Quarterly index’s earliest reference is from 1948” 
(page 637). However, well-established and nascent survey 
projects alike are experiencing steadily declining response 
rates despite this awareness. For example, the University of 
Michigan’s Survey of Consumer Attitudes (SCA) has 
witnessed a drop in response rate from about 72 percent in 
1979 to about 60 percent in 1996 and a low of 48 percent in 
2003 (Curtin, Presser and Singer 2005). 

Survey nonresponse at both the unit and item levels 
obviously represents a major challenge to survey research, 
given its potential for generating nonsampling errors in 
parameter estimates (Brehm 1993; Dillman efa/. 2002; 
Groves and Cooper 1998). For example, nonresponse may 
lead to biased point estimators, variance inflation for point 
estimators, and biases in estimators of precision (Dillman 
et al. 2002; Groves and Cooper 1998). Although unit and 
item nonresponse mean different things conceptually in the 
survey literature, their effects on a statistical estimate are 
generally the same (Groves, Fowler, Jr., Couper, Lepkowsk1, 
Singer and Tourangeau 2004). 

While a number of recent studies suggest that low (unit) 
response rates may not have serious adverse effects on data 
quality (Curtin, Presser and Singer 2000; Keeter, Miller, 
Kohut, Groves and Presser 2000; Visser, Krosnick, Marquette 
and Curtin 1996), the fact still remains that unit nonresponse 
can have negative consequences for statistical estimates 
under certain circumstances. As a result, finding creative 
ways to increase response rates so that all types of sampled 
units are represented adequately in the sample remains a key 
goal in survey research. For item nonresponse, it may be 
true that advances in post-survey techniques for handling 
missing data, such as hot-deck and cold-deck imputations, 
mean imputation, multiple imputation, and multiple 
imputation and deletion, have made it possible to reduce the 
challenges this poses. However, the ideal situation and, in 
fact, a primary goal of survey design and implementation is 
to minimize item nonresponse to the greatest extent 
possible. This is because the norm in some fields, especially 
in microeconomics, is to use only the original data 
(Cameron and Trivedi 2009). 
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3. Making salient key features of a survey 
and survey participation 


The extent to which a sampled unit views some features 
of a survey as more or less important affects the respondent’s 
likelihood of participating in the survey (Groves ef al. 
2000). Groves etal. (2000) comment on the interviewing 
tactics of experienced interviewers, arguing that what 
interviewers actually do when they tailor their queries or 
remarks to the concerns of respondents is “to heighten the 
salience of some features of the request, those they judge 
will be favorably received by the household” (page 299). 
Building on Groves and Cooper (1998), Groves et al. (2000) 
propose what they call the /everage-saliency theory to 
explain how sampled units make the decision to participate 
or decline to participate in a survey. This theory essentially 
states that there are some attributes (leverage) of a survey 
that may be viewed negatively or positively by the 
respondent, and that how these attributes are made salient 
during the survey request process affects the likelihood of 
participation. If attributes viewed positively by a sampled 
unit (positive leverage) are made salient during the survey 
request, there is a higher chance that the respondent agrees 
to participate in the survey, all other things being equal. On 
the other hand, the likelihood of a sampled unit participating 
in a survey will be hurt if attributes that are viewed 
negatively by the respondent are made salient during the 
survey request. 

Groves etal. (2000) empirically support this theoretical 
position. They present civic engagement (measured by 
community involvement) and incentives as leverages on 
survey participation, successfully showing that both attributes 
positively affect the likelihood of participation, with the 
effect of incentives diminishing among sampled units with 
higher civic engagement. In using civic engagement as a 
measure of a survey’s leverage on sampled units, Groves 
et al. (2000) observe that leverage is not measured directly. 
Instead, it may be gleaned from some characteristic(s) of 
respondents in relation to the survey or its features, which 
may exert a positive or negative influence on the likelihood 
of participation. There is also evidence that when survey 
requests are tailored to the concerns of sampled units or to 
what they consider to be important, the likelihood of their 
participation is enhanced (Dillman 2000; Groves and 
Cooper 1998). 

Based on the leverage-saliency theoretical proposition, 
we expect higher rates of participation from respondents 
whose characteristics make them more likely to view 
important attributes (leverage) of a survey positively. 
Correspondingly, we also expect those whose characteristics 
make them less likely to view such attributes positively to 
participate in the survey at lower rates. In our particular area 
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of research, we anticipate that sampled units’ proximity to 
the agricultural and rural landscape (the contextual focus of 
our on-going survey) will affect participation in the survey 
and item nonresponse. This logic also applies to our 
expectations about respondents who claim greater knowledge 
of how food is produced and who also view animal welfare 
as important (a central sub-theme of this particular work). 
We thus draw from the leverage-salience theoretical 
proposition to propose the following hypotheses. 


1. Our survey’s focus on agriculture and the environ- 
ment, which was made salient in its design, is 
expected to exert a positive leverage on respondents 
with greater social and physical proximity to 
agriculture and the rural environment (that is, those 
residing in more rural places). We thus hypothesize 
that participation rates will vary according to 
residential location. 

2. We expect respondents with a closer proximity to 
agriculture and the rural landscape to be more 
diligent in completing the survey than those not in 
close proximity, as the former are more likely to be 
motivated by the survey’s subject matter (that is, its 
positive leverage). We thus hypothesize that item 
nonresponse will vary by proximity to agriculture 
and the rural landscape. 

3. Sampled units who have greater knowledge of how 
their food is grown as well as those who view 
animal welfare as important will have fewer item 
nonresponses. Presumably, such respondents will 
have a greater interest in the survey’s focus on 
agriculture and the environment, and_ therefore 
exhibit more diligence in completing the survey. 


4. Incentives and survey participation 


The use of various forms of incentives, particularly 
prepaid (monetary) incentives, has become a common 
practice in survey research. While the practical rationale for 
offering incentives to sampled units is to encourage 
participation, the theoretical root of this practice is in part 
traceable to the social exchange theory (Dillman 1978). The 
social exchange theory assumes that people’s actions are 
primarily motivated by the returns they expect or obtain 
from engaging in an activity (Weisberg 2005). Gouldner 
(1960) elaborates on the norm of reciprocity, which is 
related to the social exchange theory, observing that “insofar 
as men live under such a rule of reciprocity, when one party 
benefits another, an obligation is generated. The recipient is 
now indebted to the donor, and he remains so until he 
repays” (page 174). In Gouldner’s view, the norm of 
reciprocity makes two demands on people: (1) people 
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should help those who have helped them, and (2) people 
should not injure those who have helped them (Gouldner 
1960, page 171). 

Dillman (1978) uses the social exchange theory and 
particularly the social norm of reciprocity to argue that 
relatively small gestures (such as personalized letters, 
incentives, and reminder letters) can evoke reciprocation 
from sampled households in terms of inclination to 
participate in a survey. Also, Weisberg (2005) notes that 
social exchange is a theory that possibly explains the 
relationship between incentives and survey participation, 
observing that “[f]rom this perspective, giving the re- 
spondent a monetary incentive to participate in the survey 
can be seen as a kindness that evokes a norm of reciprocity” 
(page 165). 

To devise ‘ways and means’ to bolster survey response 
rates as well as to test the social exchange theory in relation 
to incentive use in survey research, a number of experi- 
mental studies have examined the relationship between 
providing incentives to respondents and survey partici- 
pation. While some of these studies have focused primarily 
on the effects of incentives on response rate and item 
nonresponse (Grove, Couper, Presser, Singer, Tourangeau, 
Acosta and Nelson 2006; Trussell and Lavrakas 2004; 
James and Bolstein 1992; Church 1993; Singer 2000; 
Yammarino, Skinner and Childers 1991; Fox, Crask and 
Kim 1988), others have examined the effects of incentives 
on respondent expectations and views about surveys (James 
and Bolstein 1990; and Singer et al. 1998). Consistent with 
the main proposition of the exchange theory and the norm of 
reciprocity, many of these studies report a_ positive 
relationship between incentives and response rates (Singer 
et al. 2000; Groves, Couper, Presser, Singer, Tourangeau, 
Acosta and Nelson 2006; Church 1993; Trussell and 
Lavrakas 2004; Goyder 1982; and Yu and Cooper 1983). 

While many studies confirm the importance of incentives 
in encouraging survey participation, the empirically 
informed verdict on the relationship between incentives and 
survey participation is by no means unanimous. In a meta- 
analysis of experimental and quasi-experimental studies 
involving incentive conditions, Church (1993) reports that 
1% of the studies utilized found no evidence of incentives 
affecting participation. Church also states that 10% of the 74 
studies analyzed actually reported a negative relationship 
between the incentive conditions and survey participation. 
In fact, this reality partly prompted Groves ef al. (2000) to 
propose the leverage-saliency theory to help explain why 
“incentives sometimes work” but “sometimes don’t” (page 
299). Given that findings related to the effects of incentives 
on survey participation are moderately mixed, as well as the 
fact that the subject matter of the survey we are studying 
differs from many previous studies, we find it necessary to 
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assess incentive effects on survey participation in conjunction 
with our examination of the relationship between agricultural 
proximity (our survey’s contextual focus) and response. 
Also, we believe it is important to periodically assess the 
utility of using incentives in survey research, despite the fact 
that this subject has received a lot of attention in the past. 

Another important incentive-related issue is the potential 
higher item nonresponse impacts of inducing reluctant 
respondents to participate in a survey (see Hansen 1980). 
The potential harm exists in that using persuasions such as 
incentives might elicit information from respondents who 
are careless or indifferent when answering questions, 
ultimately damaging the quality of the information obtained 
in this way (Singer et a/. 2000). Owing to this concern, a 
number of studies have examined the relationship between 
incentives and item nonresponse, many of which suggest 
that incentives do not seriously harm response quality; that 
is, incentives do not generate higher item nonresponse 
(Singer et al. 2000; Singer et al. 1998; Shettle and Mooney 
1999 and Davern et a/. 2003). In fact, Singer et al. (2000) 
actually report that prepaid incentives help to reduce item 
nonresponse, an often-used measure of response or data 
quality. However, they also report that respondents who 
received incentives were more likely to give optimistic 
answers in some cases and be more pessimistic in others 
(involving different variables). In our case, a critical concern 
is that urban respondents induced to participate may provide 
lower quality data (as measured by nonresponse) than 
respondents more proximate to the agricultural and rural 
landscape. 

In summarizing the review, we find that the research 
generally suggests that incentives help improve response 
rates in surveys, with little or no effect on item nonresponse. 
Although this is generally the case, some findings on the 
relationship do deviate from this expectation (Church 1993). 
Also, while many studies find that providing prepaid 
incentives does not affect item nonresponse, the work of 
Singer ef al. (2000) suggests that providing incentives can 
compromise data quality via the mechanism of optimism or 
pessimism bias. Given these caveats, as well as the fact that 
most prior work on the relationship between incentives and 
survey participation was based on bivariate analysis 
(incentive and survey participation), we find it necessary to 
reconsider the impact of incentives on survey nonresponse 
while taking into account the effects of residential location 
in space and socioeconomic status. Thus, drawing from this 
literature on how incentives are related to survey 
participation and item nonresponse, we make the following 
hypotheses. 


1. Respondents who received incentives will participate 
in the survey at higher rates than non-recipients, net 
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the effects of proximity to the agricultural and rural 
landscape and socioeconomic status. 

Incentives will be negatively related to item non- 
response. That is, surveys completed by respondents 
who received incentives will have fewer missing 
data points than those completed by respondents 
who did not receive incentives, controlling for the 
effects of respondents’ proximity to the survey’s 
subject and other covariates. 


tv 


5. Study design 


This paper is based on a survey of public views regarding 
food, agricultural and environmental issues, with a special 
focus on farm animal welfare. The target population of the 
survey was Ohio households. An initial sample of 3,000 
respondents (along with their residential addresses) was 
drawn for the study via stratified random sampling: one-half 
(1,500) from Ohio’s 22 core metropolitan counties and the 
second half (1,500) from the state’s 66 metropolitan fringe 
or non metropolitan counties. The number of households in 
the core metropolitan counties differed from those in the 
metropolitan fringe or non metropolitan counties, making 
the sample a disproportionate random sample. To account 
for the unequal probability of selection across the two strata, 
we conducted weighted analysis for this paper. 

The sample we used was obtained from Experian, a U.S.- 
based credit reporting bureau and private list vender. The 
sample was drawn from a sample frame (database) 
consisting of Ohio households along with their residential 
addresses. While we do not pretend that this sample frame 
covers all Ohio households, we believe that it is one of the 
most reliable and up-to-date lists and databases in the U.S. 
from which one can draw a sample. According to Experian, 
the database is updated monthly. 

The survey followed a modified tailored design method 
(Dillman 2000) with up to four mailings sent to potential 
respondents during the spring of 2007. The first mailing was 
a pre-notification letter sent to each sampled unit, followed 
shortly by the survey packages. The third mailing was a 
reminder postcard sent to respondents thanking them for 
participating in the study or encouraging them to complete 
and return the survey if they had not yet done so. In the 
fourth mailing, replacement survey packages were mailed to 
respondents who had not returned completed questionnaires 
about 10 days after the postcard was mailed out. Of these 
four contacts with the respondents, three had information 
that focused specifically on the subject or topic of the 
survey. The pre-notification letter and the cover letters for 
the initial and replacement survey packages specifically 
conveyed to respondents the subject matter of the survey. 
Also, the graphics printed on the cover page of the survey 
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(images of farm animals) were selected to further convey 
this subject matter. 

The addresses of sampled units were geo-coded and 
placed in a locational field (see details later in this section) 
to locate them geographically across the rural-urban 
continuum. This allowed us to conduct analyses of how 
sampled units’ proximity to the agricultural landscape is 
related to their likelihood of participating in the survey. We 
recognize that some urban residents may have frequent 
social and physical interactions with agriculture and the 
rural landscape; however, this kind of interaction, along 
with its effects on support for agriculture and the 
environment, is highest among those residing in more rural 
and open country places (Freudenburg 1991; Sharp and 
Adua 2009). A randomized experiment involving incentives 
was also built into the survey. The first survey packages 
mailed to a randomly-selected half of the sampled units 
included $2.00 (two one dollar bills) incentives, while the 
other half of the sample received the same package but 
without any incentives. In doing this experiment, our 
pragmatic objective was to assess the effectiveness of our 
practice of enclosing modest cash incentives in survey 
packages to improve participation in our ongoing surveys of 
the Ohio public. Similar to Groves et al.’s (2000) expecta- 
tions about the effect of community involvement on levels 
of participation, we also anticipated that households located 
in close proximity to agriculture and the rural landscape 
would participate at high levels in our study independent of 
the incentive, perhaps to the extent that a token financial 
incentive might be deemed unnecessary in future iterations 
of the survey. 


5.1 Analytic strategy 


Two sets of statistical analyses are conducted in this 
paper. The first set of analyses focuses on survey partici- 
pation (response rate). First, we examine the proportion of 
successfully contacted sampled units who complete and 
return surveys by residential location along the rural-urban 
continuum, a proxy for geographic proximity to agriculture 
and rural areas of the state (an assumption we justify in a 
later section), and by incentive status. Following the 
American Association of Public Opinion Research’s 
(AAPOR) 2008 guidelines for codes disposition, we defined 
successfully contacted sampled units as (i) those from 
whom we received completed surveys by the end of the data 
collection phase of the project, and (ii) those from whom we 
received neither a completed survey nor the survey package 
back from the United States Postal Service (USPS) as 
undeliverable. In our contract with the USPS, we requested 
that all mails that could not be delivered due to wrong 
address or absence of forwarding information be returned to 
us. The sampled units to which these undeliverable mails 


eh) 


were addressed were classified as units we were un- 
successful in contacting. We also employ logistic regression 
to further analyze the likelihood of survey participation 
(coded 1=responded; O=did not respond), using 
residential location along the rural-urban continuum and 
incentive status as the primary predictors, while simulta- 
neously controlling for the effects of socioeconomic status 
at respondents’ block group level as per the 2000 U.S. 
population census. We control for the effect of socio- 
economic status because previous studies suggest it has 
some relationship with survey participation (Davern ef al. 
2003; Singer et al. 2000). 

The second set of analyses focuses on item nonresponse. 
In this analysis, we conduct partial proportional ordered 
logistic regression analysis (generalized ordered logit) on the 
first two item nonresponse variables (0 = no missing items; 
1 =some missing items; and 2 =numerous missing items), 
once again employing residential location along the rural- 
urban continuum and incentive status as the primary 
independent variables while controlling for the effects of 
several other variables. Generalized ordered logit (partial 
proportional odds) is employed rather than ordered logit 
because some predictors in these models violated the 
proportional odds assumption of ordered logistic regression. 
By using partial proportional odds modeling, we are able to 
constrain the relationship between those independent and 
dependent variables that met the proportional odds assump- 
tion of ordered logistic regression while allowing the rela- 
tionships that failed this assumption to vary. To analyze the 
third item nonresponse variables, we employed logistic 
regression. This variable was recoded into a dichotomy (see 
the section on operationalization of variables for more 
details). 


5.2 Operationalizing dependent variables 


Survey Participation: Survey participation (response 
rate) is measured by computing the number of completed 
surveys received from respondents (eligible participating 
cases) as a proportion of the sampled units contacted 
successfully (all eligible cases). This measure of survey 
participation is in conformity with AAPOR guidelines for 
measuring response rates. Undeliverable surveys returned 
by the USPS without additional information, such as 
forwarding address or address correction, were treated as 
ineligibles. Cases for which we neither received completed 
surveys nor any other information about the cases from the 
USPS were treated as eligible based on the recommendation 
of the AAPOR’s 2008 revised standard definitions of codes 
disposition and outcome rates. To conduct the logistic 
regression analysis of response likelihood, we coded all 
successfully contacted sampled units (eligible cases) as | 
(returned a completed questionnaire) or 0 (did not return a 
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completed questionnaire). We provide no descriptive 
statistics for this variable here as the analysis section, 
especially the marginals of the contingency tables, provides 
a good sense of the distribution of this variable. 

Response quality: Response quality is measured by the 
occurrence of item nonresponse (see Davern et al. 2003; and 
Kaldenberg, Koenig and Becker 1994). To compute item 
nonresponse, missing data points for all respondents 
participating in the survey were summed across three subsets 
of items in the survey instrument to generate three item 
nonresponse variables: item nonresponse I, item nonresponse 
II and item nonresponse III. The item nonresponse I variable 
was created from items that, in our estimation, exerted 
comparatively the lowest cognitive demand on respondents, 
including such items as demographics and opinion questions 
that did not require very much introspection. The item 
nonresponse II variable was created from items that exerted 
comparatively higher cognitive demands on respondents 
than those used to create item nonresponse I, such as 
questions that required significant recall efforts and opinion 
questions that required a high level of introspection. The 
third variable is constructed from items that exerted 
comparatively the highest cognitive demand on respondents, 
such as knowledge questions and questions that required 
some understanding of concepts associated with animal 
husbandry. 

In summing across these variables, we did not treat 
‘Don’t Know’ answers as item nonresponse, given that the 
survey had a couple of knowledge questions for which a 
‘Don’t Know’ response could be a legitimate answer. The 
item nonresponse variable also does not include “refused to 
answer” responses, as this option was not provided in 
questions used in the creation of the variables. We also 
excluded from these variables questions that respondents 
were directed to skip if they found them to be inapplicable. 

Owing to the fact that the distribution of these variables 
was heavily skewed (see Table 1), the item nonresponse I 
and nonresponse II variables were regrouped into three 
ordinal categories (0 = no missing items; 1 = some missing 
items; and 2 = numerous missing items) and analyzed using 
generalized ordered logit. The first category (0) included 
cases without any item nonresponse, while the second 
category (1) included cases with between | and 9 incidences 
of nonresponse. The third category (2) included cases with 
10 or more item nonresponses. For our analysis, we also 
regrouped the item nonresponse III variable into a 
dichotomy: 0 (no missing cases) and | (1 or more missing 
cases). This variable was regrouped differently from the first 
two because very few cases (only 19) satisfied the criteria 
for classification as “numerous missing cases” (Table 1). To 
verify whether our regrouping of these variables masked 
variances in item nonresponse within the groups (cases 
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grouped together) that may be explained by our two 
independent variables (residential location, i.e. an indicator 
of interest in the survey topic, and incentives), we conducted 
a one-way analysis of variance for these grouped cases. 
Within these groups, none of the three item nonresponse 
variables varied significantly by residential location or 
incentives. Descriptive statistics for all three item 
nonresponse variables are reported in Table 1. 


Table 1 
Descriptive statistics for item nonresponse variables 
Item Item Item 
nonresponse nonresponse nonresponse 
II Il 
Statistics before recoding 
N 971 971 971 
Mean Syl 2.34 1.6 
Standard deviation 5.06 He/5) B25) 
Minimum value 0 0 0 
Maximum value 44 48 29 
Statistics after recoding 
into groups 
Zero missing 30.07% 59.53% 54.69% 
Some missing 62.31% 32.65% 43.36% 
Numerous missing 7.62% 7.83% 1.96% 


5.3 Operationalizing independent and control 
variables 


Residential Location: The survey’s focus on agricultural 
and environmental issues was made salient during the 
survey request (via the pre-notification letters, the cover 
letters and the design of the survey instrument), which can 
affect participation negatively or positively depending on 
each respondent’s residential location along the rural-urban 
continuum. Residential location is an indicator of 
respondents’ differentiated social and physical proximity to 
agriculture and the rural landscape. This is because prox- 
imity can increase the social and/or physical interactions 
with the subject. The association between proximity and 
environmental concern has been proposed and _ tested 
numerous times by social scientists (Dunlap and Heffernan 
1975; Freudenburg 1991; Sharp and Adua 2009). We go a 
step beyond hypothesizing attitudinal differences associated 
with proximity and anticipate different levels of survey 
participation; indeed, we hypothesize that sampled units 
residing closer to agriculture and the rural landscape will 
participate in the survey at higher rates than those in core 
urban places. As a result, the subject matter of our survey is 
expected to serve as a positive leverage on sampled units 
residing closer to agriculture and the rural landscape. While 
this may not be a direct measure of leverage, it is consistent 
with Groves efal.’s (2000) suggestion that the leverage a 
given survey exerts on a sampled unit can be measured 
indirectly by relying on pertinent characteristics of the 
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sampled unit. In using the spatial residential characteristics 
of sampled units, we are relying on the fact that sampled 
units residing in more rural and open country areas have a 
higher likelihood of social and physical interaction with the 
agricultural and rural landscape than those in more 
urbanized places (see Table 2). In both 2006 and 2007, 
higher proportions of residents of exurban townships and 
rural areas (a combination of rural city/village and rural 
townships) visited a working farm than residents of core 
urban places, as shown in Table 2. We acknowledge that 
using information from our own respondents to show the 
association between residential location and visits to farms 
may be problematic. However, this information is 
corroborated by information from a different sample, the 
2006 Ohio Survey. 

To determine the residential location of the sampled 
units, each respondent’s residential address was geocoded 
and assigned to one of four location fields — urban, 
suburban, exurban or rural—using ESRI’s ArcView 
geocoding. Sampled units living in the exurban and rural 
fields were further distinguished as residing in either 
incorporated places (city/village) or township places (open 
country). This process of characterizing sampled units as 
living in urban, suburban, exurban, or rural places has 
previously been employed successfully in the field of 
regional science (Audirac 1999; Sharp and Clark 2008). 

In this study, this variable has been grouped into five 
categories: (1) core urban, (2) suburban places, (3) exurban 
city/village, (4) exurban township and (5) rural places (cites/ 
villages and townships). The ordering of the categories does 
not suggest a monotonic increasing order in terms of 
proximity to agriculture and the rural landscape between 
categories | and 5. Instead, this variable should be seen as a 
nominal variable with categories that can be grouped into 
blocks based on proximity to agriculture and the rural 
landscape: block I(categories 1 and 2) has the lowest 
proximity, block 2 (category 3) has intermediate proximity 
and block 3 (categories 4 and 5) has the highest proximity. 


Table 2 
Frequency of visiting or touring a working farm 
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Between the blocks, the categories are monotonic increasing 
in terms of proximity to agriculture and the rural landscape, 
but within the blocks the pattern is less certain. Here, too, 
we provide no descriptive statistics for this variable as the 
analysis section provides an ample sense of how the variable 
is distributed. 

Knowledge of Food Production and Support for Animal 
Welfare: Two other indicators of survey leverage used in the 
analysis are two survey items that measured sampled units’ 
knowledge of how their food is produced and their views 
about the importance of animal welfare. The first asked, 
“How knowledgeable are you about how your food is 
grown? Please indicate on a scale of | to 7 your level of 
knowledge.” This item had a mean of 4.47 and a standard 
deviation of 1.60. The second item asked, “Thinking about 
farm animals in general, how important is this issue to you? 
Please indicate on a scale of | (not important) to 7 (very 
important).” This item had a mean score of 4.50 and a 
standard deviation of 1.68. These two indicators are used in 
analyses pertaining only to the item nonresponse variables. 

Incentive Status: Sampled units’ incentive _ status 
(received versus did not receive incentive) is a primary 
independent variable in the regression models. Incentive 
status is dummy-coded as 0 (did not receive incentive) and | 
(received incentives). Again, we provide no descriptive 
statistics for this variable because the analysis provides a 
good sense of the variable’s distribution. 

Control Variables: Control variables operationalized in 
one or more of the analysis conducted in this study include 
Age (respondent’s age as of his/her last birthday), Education 
(highest level of education completed), Ethnicity (white = 1; 
all others = 0) and Gender (male = 0 and female = 1), as well 
as the per capita and disposable median household income of 
each sampled unit’s block group as per the 2000 population 
census. We control for the effects of these variables because 
previous studies suggest they can affect item nonresponse 
(Davern et al. 2003; Singer et al. 2000). Descriptive statistics 
for these purely control variables are shown in Table 3. 


2006 Ohio Survey" 2007 Animal Welfare Survey” 
Never/ Occasional/ Never/ Occasionally/ 
Residential location seldom frequently Total‘ seldom frequently Total‘ 
Core urban 90.4% 9.6% 100% (185) 81.0% 19.0% 100% (121) 
Suburban place 87.5% 12.5% 100% (536) 83.7% 16.3% 100% (285) 
Exurban city/village (Incorporated) 78.6% 21.4% 100% (217) 76.4% 23.6% 100% (124) 
Exurban township (Unincorporated) 74.9% Dono 100% (434) 67.9% 32.1% 100% (264) 
Rural place 73.1% 26.9% 100% (238) 70.6% 29.4% 100% (136) 
Total 80.6% 19.4% 100% (1,610) 74.2% 25.8% 100% (930) 


“Second-order corrected chi-square (3.61) = 43.3; P = 0.0000 (corrected for survey design effects) 
» Second-order corrected chi-square (3.67) = 16.7; P= 0.001 (corrected for survey design effects) 
“In parentheses are the total number of eligible cases from each residential category. 
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Table 3 
Descriptive statistics for control variables 


Mean/percent Standard deviation 


Education: 
High school and lower 
Some college 
Bachelor’s degree 
Grad/professional work & higher 


Gender: 
Male 
Female 


Ethnicity: 
White 
Non-white 


Age: 
Block level mean household income, 2000 
Block level median household income, 2000 


36.8% . 
32.3% - 
13.7% - 
17.2% - 


48.2% = 
51.8% - 


91.7% - 
8.3% - 


Sill) 15.8 
49,8423 DS PPS ai 
42,616.3 16,728.6 


6. Results 


To evaluate survey participation, we use both bivariate 
analysis (contingency tables) and logistic regression 
modeling. For the contingency tables, we use Pearson chi- 
squared statistics corrected for survey design with Rao and 
Scott’s (1984) second-order correction. We do this because 
survey design features such as stratification and clustering 
can affect tests of association (Lohr 1999). To limit the 
length of this paper, we follow a different analytical plan for 
the item nonresponse set of variables. For this set, we 
conduct only multivariate analysis (logistic regression). 
Moving straight to multivariate analysis allows us to 
examine the partial effects of the various predictors used in 
the models while keeping the paper brief. 


6.1 Bivariate results for survey participation 


The bivariate analysis suggests that survey participation 
varies significantly by proximity to the agricultural and rural 
landscape (residential location along the rural-urban 
continuum). As shown in Table 4, respondents residing in 
geographically more rural places (rural and exurban 
township residents) have higher rates of participating in the 
survey than those residing in geographically more urban 
places (core urban and suburban residents). The analysis 
also shows that those in the intermediate exurban 
incorporated places (cities and villages) were slightly more 
likely to participate than core urban residents. A second- 
order corrected chi-square test (Rao and Scott 1984) of the 
relationship between survey participation and_ residential 
location was significant (y = 14.2; df= 3.7; and p = 0.003). 

Our analysis is consistent with previous studies, also 
finding that prepaid incentives significantly increase survey 
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participation (Table 5). Despite the fact that the context of 
the survey used for our analysis differs markedly from 
previous studies examining the effects of incentives, we find 
that the response rate for successfully contacted incentive 
recipients was 43.7% compared with 26.9% for successfully 
contacted sampled units who did not receive prepaid 
incentives. The second-order corrected chi-square test of 
this bivariate relationship is also statistically significant 
(y= 73.8; df= 1; p=0.000). In fact, our analysis suggests 
that eliminating incentives altogether substantially hurts 
participation rates for all categories of respondents regard- 
less of proximity to the agricultural and rural landscape, 
although this effect is highest for residents in core urban 
places (Table 6). This finding provides support for our 
ongoing practice of using prepaid monetary incentives to 
help bolster our response rates with no discrimination 
between whether respondents reside in rural or urban 
locales. It also reaffirms the importance of incentives in 
survey research. 


6.2 Logistic regression model for survey 
participation 


Multivariate analysis further suggests that the likelihood 
of survey participation varies significantly by proximity to 
agriculture and the rural landscape, statistically holding 
constant the effects of incentive status (received versus did 
not receive incentive). Residents of suburban places, 
exurban townships, and rural places are significantly more 
likely to participate in the survey than residents of core 
urban places (Table 7). For example, residents of exurban 
townships and rural places have higher odds (0.60 log odds 
and 0.37 log odds, respectively) of participating than those 
of core urban places. 


Survey Methodology, June 2010 


Table 4 

Participation rate by residential location 
Residential location Responded Did not respond Total’ 
Core urban 29.5% 70.5% 100% (424) 
Suburban place 32.6% 67.4% 100% (917) 
Exurban city/village (Incorporated) 33.1% 66.9% 100% (379) 
Exurban township (Unincorporated) 40.5% 59.5% 100% (684) 
Rural place 35.8% 64.2% 100% (405) 
Total 35.4% 65.6% 100% (2,809) 


Second-order corrected chi-square (3.7) = 14.2; P = 0.003 (corrected for survey design effects) 


“ Tn parentheses are the total number of eligible cases from each residential category 


Table 5 

Survey response by incentive status 
Incentive status Responded Did not respond Total* 
Incentive 43.7% 56.3% 100% (1,410) 
No incentive 26.9% 73.1% 100% (1,401) 
Total 35.4% 64.6% 100% (2,811) 


Second-order corrected chi-square (1) = 73.8; P = 0.000 (corrected for survey design effects) 


“Tn parentheses are the total number of eligible cases by incentive status 


Table 6 


Response rate by incentives and residential location along the rural-urban continuum 


Incentive recipients 


Core urban 0.41 
Suburban place 0.41 
Exurban city/village (Incorporated) 0.39 
Exurban township (Unincorporated) 0.48 
Rural place 0.44 
Total 0.43 


Non-recipients of incentive Response difference 


0.19 0.22 
0.24 0.17 
0.27 0.12 
0.31 0.17 
0.27 0.17 
0.26 0.17 


Logistic regression analysis also seems to confirm our 
earlier finding that the likelihood of participating varies 
significantly by whether or not a sampled unit received 
incentives. Respondents who received incentives had higher 
odds (0.73 log odds) of participating in the survey than 
those who did not receive incentives, controlling for 
proximity to agriculture and the rural landscape as well as 
the gender (female=1) of the householder randomly 
assigned as the preferred household member to complete 
and return the survey (Table 7). 

Because socioeconomic status varies significantly by 
residential location across space (Lobao 1990) and affects 
survey response (Davern ef al. 2003; Singer et al. 2000), we 
endeavored to control for the potential effects of per capita 
income and household income (socioeconomic status) on 
the likelihood of survey participation using hierarchical 
linear modeling (HLM). To do this, respondents were linked 
to their block groups and block group characteristics 


(specifically, block group per capita income and block 
group household median income) as per the 2000 U.S. 
population census. For the HLM analysis, we initially 
estimated a fully unconditional model (that is, an ANOVA) 
to determine whether the likelihood of survey participation 
varied significantly across the block groups. In hierarchical 
linear modeling, estimating a fully unconditional model 
(model without predictors at all levels of the analysis) is 
typically used to determine whether the dependent variable 
varies by the level two (or higher) unit of analysis, such as a 
neighborhood, block group or school district. This initial 
model (ANOVA) often helps researchers determine whether 
to proceed with multi-level analysis. Our initial HLM 
analysis (ANOVA) did not reveal any significant variation 
in the likelihood of survey participation across the block 
groups (tau = 0.04; p = 0.493). While this finding suggests 
the average probability of survey participation is about the 
same for all block groups despite their different per capita 
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and household disposable median incomes, we acknowledge 
potential instability in this HLM model given that sample 
cases per block group were generally low. This may have 
led to our finding of no significant variation in the like- 
lihood of participation across the block group (potential 
Type II error). Despite this potential problem with our fully 
unconditional model, we did not proceed with the fully 
conditional multi-level analysis. 


6.3 Logistic regression model for item nonresponse 


As noted earlier in this section, our analysis of item 
nonresponse is limited to multivariate modeling, and we do 
this primarily to keep the paper brief while achieving our 
objective of assessing the partial effects of our main 
independent variables. The data suggest that the anticipated 
leverage of the survey’s subject is only modestly related to 
item nonresponse. With respect to item nonresponse | (that 
is, the variables created from questions with the least 
cognitive demand on respondents in the survey), the 
analysis suggests that respondents in exurban township 
areas have lower item nonresponse (-0.74 log odds) than 
those residing in core urban areas, although this difference 
disappears at the higher values of this variable (Table 8, 
Columns 2 and 3). However, for item nonresponse II (the 
item nonresponse variables created from questions more 
cognitively demanding than those used in item nonresponse 
I) we find that residents of exurban townships and rural 
places are more likely to have higher item nonresponses 
(0.85 and 0.82 log odds, respectively) than residents of core 
urban areas (Table 8 Column 4). In terms of item 
nonresponse III (the item nonresponse variables created 


Table 7 
Logistic regression’ of likelihood of participation 


from the most cognitively demanding questions), the 
analysis did not reveal any significant difference by 
residential location, our proxy for level of interest in the 
survey’s topic. 

Supporting the anticipated effect of interest in a survey’s 
topic on item nonresponse, the analysis also suggests that 
respondents’ knowledge of how food is produced is 
significantly related to item nonresponse. In terms of item 
nonresponse II, the data shows that respondents who 
reported knowing how food is produced have lower log 
odds (-0.13) of item nonresponse than those who reported 
having less knowledge of how food is produced (Table 8, 
Column 4). This relationship is stronger at higher values of 
the variable: knowledge of how food is produced has lower 
log odds (-0.35) of item nonresponse when the category 
value shifts from 0 to 1 (Table 8 Column 5). This result 
suggests that the positive leverage of the survey’s topic may 
have resulted in greater care in the completion of the survey 
among respondents with greater knowledge of how food is 
produced. We also find that respondents’ views about the 
importance of animal welfare, a central subtheme of this 
particular survey, are positively related to item nonresponse 
(Table 8, Column 4). As shown in Table 8, a one unit 
increase in viewing animal welfare as important leads to a 
0.09 unit increase in the log odds of item nonresponse 
(specifically item nonresponse II). This finding is 
inconsistent with our expectations. 

In terms of the effects of incentives, we find no 
significant relationship between incentives and any of the 
three measures of item nonresponse (Table 8, Columns 2, 4 
and 6), contrary to our expectation. 


Log odds of participation 


b Std. Error 

Incentive status 

Did not receive incentive (Ref) = 

Received incentive Omi 0.09 
Residential location 

Core urban residents (Ref) - - 

Suburban residents 027% 0.13 

Exurban city/village residents O25) 0.15 

Exurban township residents OREO 0.13 

Rural residents CSTs 0.15 

First option to respond (Female = 1) -0.05 0.09 
Model statistics 

Intercept -1.42*** 

Wald y (df= 6) OB ae 


Significance: ***< 0.001; **< 0.01; and *< 0.05 


“In this model we tested for potential interaction effects between residential location and incentives. We found no evidence of such an 


interaction effect. 
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Table 8 


Logistic regression models’ for item nonresponse 


105 


Incentive status 


Did not receive incentive 
Received incentive 


Subject salience —Residential location 
Core urban residents 
Suburban residents 
Exurban city/village residents 
Exurban township residents 


Residents of rural places 


Subject salience —Food knowledge and animal 


welfare 


Knowledge about how food is produced 


Importance of animal welfare 


Controls 


Education: 


High school and lower 


Some college 
Bachelor’s degree 


Grad/professional work & higher 


Age 


Gender (Female = 1) 


White 


Model statistics 


Intercept 


Wald chi-square® 


N 


Significance: ***< 0.001; **< 0.01; and *< 0.05 
Standard errors shown in parentheses. 


Item nonresponse ii 


No 
missing: 
log odds 


Some 
missing: 
log odds 


0.30 
(0.40) 


-4.36 


Item nonresponse II” 


Item nonresponse III* 


No 
missing: 
log odds 


0.10 
(0.17) 


0.54 
(0.29) 
0.30 
(0.34) 
0.85** 
(0.30) 
0.82** 
(0.31) 


-0.13* 
(0.05) 

0.09* 
(0.04) 


Some 
missing: 
log odds 


Log odds 


A sire 
(0.09) 


“ We tested for potential interaction effects between residential location and incentives, between age and incentives and between ethnicity 


(white) and incentives in these models following Singer et a/. (2000). We found no evidence of such interaction effects. 


> The item nonresponse I and II models are partially constrained proportional odds logit models. This is because some of the predictors of 
these models violated the parallel lines assumption. These predictors were thus allowed to vary, while the remaining ones were 
constrained. William’s (2006) gologit2 stata program code was used to estimate the model. 


“ This model is a logistic regression model with a binary dependent variable (variable recoded into two categories). 
¢ Degrees of freedom are 14, 14, and 13 for the low cognitive, mid cognitive, and high cognitive models, respectively. 
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In terms of the control variables, we find that education is 
significantly related to item nonresponse, which is 
consistent with the earlier findings of Singer ef al. (2000). In 
our case, respondents with some college work, a bachelor’s 
degree, or some graduate/professional work have lower 
odds (-0.79, -1.08, and -0.99 log odds respectively) of 
missing cases for the survey questions with the lowest 
cognitive demand (item nonresponse 1) than those with only 
a high school education or less (Table 8, Column 2). 
Surprisingly, item nonresponse related to the survey 
questions that were comparatively higher in cognitive 
demand (that is, item nonresponse II and item nonresponse 
III) did not differ by education (Table 8, Columns 4 and 6). 
We also find positive relationships between age and all three 
measures of item nonresponse (Table 8, Columns 2, 4, and 
6), which is consistent with Singer etal. (2000). Equally 
consistent with the earlier work of Singer et al. (2000), the 
analysis reveals that female respondents are more likely to 
have missing data points than male respondents (Table 8, 
Column 4). However, the effect of gender on item 
nonresponse in our study is limited to those survey 
questions with a medium level of cognitive demand (the 
item nonresponse II variable). 


7. Discussion and conclusions 


In this study, we examined factors related to both unit 
and item nonresponse in survey research, focusing on 
interest in a survey’s topic and prepaid incentives. The 
obvious reason for carrying out this analysis is the fact that 
nonresponse (unit or item) represents a major challenge to 
survey research given its potential for generating non- 
sampling errors in parameter estimates (Brehm 1993; 
Dillman etal. 2002; Groves and Cooper 1998). As 
previously noted, nonresponse can lead to biased point 
estimators, variance inflation for point estimators, and biases 
in estimators of precision (Dillman et a/. 2002; Groves and 
Cooper 1998). Therefore, our primary goal is to provide 
information that will help researchers understand and deal 
appropriately with nonresponse, that is, minimize unit 
nonresponse and correctly understand and handle missing 
cases (item nonresponse). 

Our analysis reveals that the likelihood of participation in 
this survey on agriculture and the environment varies 
significantly by sampled units’ proximity to the agricultural 
and rural landscape (residential location). Our analysis is 
consistent with our first hypothesis and the theoretical 
proposition of leverage-salience, as we find that residents of 
exurban townships and rural places are all significantly 
more likely to participate in the survey than residents of core 
urban places. The pattern of relationships revealed in this 
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analysis is most likely explained by the fact that respondents 
residing in exurban townships and rural places have a higher 
chance of interacting with the agricultural and rural 
landscape than those residing in core urban places (see Table 
2). Thus, we suggest that respondents residing closer to the 
agricultural and rural landscape participated at higher rates in 
the survey due to the positive leverage of the survey’s focus 
on the agricultural and environmental domain. 

We also find some relationship between interest in the 
survey’s topic (measured by proximity to the agricultural 
and rural landscape) and response quality (measured by item 
nonresponse). In support of our second hypothesis, modest 
evidence in this study suggests that item nonresponse varies 
by proximity to the agricultural and rural landscape. For 
item nonresponse I, the data suggest that residents of 
exurban township areas are less likely to have missing data 
points than residents of core urban places, whereas residents 
of both exurban townships and rural places are more likely 
to have missing data points for item nonresponse II. Missing 
cases associated with questions with the highest cognitive 
demand (item nonresponse III) did not vary by residential 
location (interest in the survey’s topic). These findings 
suggest that residents of the more rural places (exurban 
townships and rural places) fare worse than those of core 
urban places when missing cases involve survey questions 
with a moderate level of cognitive demand. Although this 
result is intriguing, we are unable to explain why it is the 
case. One possible argument would be the educational 
difference between residents of core urban and rural places, 
but this study statistically controls for the effects of 
education. Further work certainly needs to be done on this 
subject. 

Knowledge of how food is produced, another indicator of 
proximity to agriculture and the rural landscape, is 
negatively related to item nonresponse, which is consistent 
with our expectation (hypothesis 3) and the leverage- 
saliency theory. As the knowledge of how food is produced 
is related to the broader topic of the survey, we believe that 
making the survey’s focus on agriculture and the 
environment salient in our request for participation in the 
survey may have generated higher diligence in questionnaire 
completion among respondents who knew or cared enough 
to know how food is produced. However, our analysis also 
suggests that support for animal welfare is positively related 
to item nonresponse, which is inconsistent with hypothesis 
3. These findings highlight the need to look closely at 
factors related to a survey’s topic as potential covariates of 
item nonresponse and its corollary, nonresponse error. 

Although the survey used in this study focused on 
agriculture and the environment, our findings in relation to 
the survey’s topic may have implications for surveys that 
focus on other sectors. There is reason to believe that unit 
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and item nonresponse can be affected by respondents’ 
proximity to or level of interest in any survey topic or 
industry of focus, especially if this aspect of the survey is 
made salient during the request for participation. For 
example, if a survey focuses on the automotive industry and 
this feature is made salient during the request for 
participation, it is very likely that this information will affect 
the response pattern. In essence, these findings suggest that 
researchers designing surveys need to think critically about 
how the survey’s subject context, such as the industry or 
sector on which it focuses, might affect participation from 
subpopulations within the sample list. While this gener- 
alization may be reasonable, we believe similar studies 
focusing on other sectors will be required before we can 
draw firm conclusions. 

We next discuss the relationship between prepaid 
incentives on the one hand and survey participation and item 
nonresponse on the other. With respect to the relationship 
between incentives and response, our study suggests that 
prepaid incentives generally increase the likelihood of a 
respondent participating in a survey, even if proximity to 
agriculture and the rural landscape (the survey subject 
context) is taken into account. Our findings are consistent 
with hypothesis four and the previous literature (Singer 
etal. 2000; Groves 2006; Church 1993; Trussell and 
Lavrakas 2004; Goyder 1982; and Yu and Cooper 1983), as 
they show that recipients of prepaid incentives were 
significantly more likely to participate in the survey than 
non-recipients, controlling for other variables in the logistic 
regression model. The analysis demonstrates that elimi- 
nating incentives altogether hurts the likelihood of 
participation regardless of respondents’ residential context. 
While we may not have overtly identified prepaid incentives 
with the leverage-saliency theory of Groves ef al. (2000) in 
the earlier sections of our discussion for the sake of 
analytical convenience, our findings in relation to this 
variable also provide further empirical support for this 
theory. Our findings clearly suggest that token financial 
incentives enclosed with each survey package helped 
increase participation from both metropolitan and non- 
metropolitan areas of Ohio, although this effect was higher 
in the former. This result provides fresh justification for the 
widespread use of incentives to bolster response rates. As 
indicated earlier in this paper, the widespread use of prepaid 
incentives in surveys makes it necessary to periodically 
assess the utility of this practice. Our finding also suggests 
the need to check for potential response bias if incentives 
are provided to only a section of the sampled respondents, 
such as when prepaid incentives are targeted at those 
assessed as being less likely to participate. 

In terms of the relationship between incentives and item 
nonresponse, we find no significant variation in missing 
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data points between respondents who received monetary 
incentives and those who did not, contrary to our fifth 
hypothesis. This finding, which controls for the effects of 
residential location (proximity to the agricultural and rural 
landscape) and other pertinent variables, is consistent with 
the earlier work of Davern et al. (2003), who failed to find 
any relationship between incentives and the number of 
imputations for missing data points. Thus, while the use of 
monetary incentives correlates significantly with unit 
nonresponse (outright nonparticipation in a survey), we find 
no relationship between incentives and item nonresponse 
(failure to respond to some questions on a questionnaire). 
Thus, providing incentives to a respondent does not 
necessarily lead to greater diligence in survey completion. 
The analysis revealed some interesting results with 
respect to the relationship between some of the control 
variables and item nonresponse. While education, age and 
gender were used in this study primarily as control 
variables, the fact that they were found to be significantly 
related to item nonresponse raises practical concerns about 
handling missing cases in survey data. Before choosing 
between the various techniques for handling missing cases 
(see Fuchs and Kenett 2007), analysts will need to check for 
potential nonresponse bias resulting from the effects of these 
variables, especially if they will be part of an analysis. 
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Evaluating within household selection rules under a multi-stage design 


Tom Krenzke, Lin Li and Keith Rust | 


Abstract 


The 2003 National Assessment of Adult Literacy (NAAL) and the international Adult Literacy and Lifeskills (ALL) surveys 
each involved stratified multi-stage area sample designs. During the last stage, a household roster was constructed, the 
eligibility status of each individual was determined, and the selection procedure was invoked to randomly select one or two 
eligible persons within the household. The objective of this paper is to evaluate the within-household selection rules under a 
multi-stage design while improving the procedure in future literacy surveys. The analysis is based on the current US 
household size distribution and intracluster correlation coefficients using the adult literacy data. In our evaluation, several 
feasible household selection rules are studied, considering effects from clustering, differential sampling rates, cost per 
interview, and household burden. In doing so, an evaluation of within-household sampling under a two-stage design is 
extended to a four-stage design and some generalizations are made to multi-stage samples with different cost ratios. 


Key Words: Intracluster correlation; Design effects; Multi-stage sampling. 


1. Introduction 


The 2003 National Assessment of Adult Literacy 
(NAAL), conducted for the National Center for Education 
Statistics, provided an indicator of the nation’s progress in 
English literacy for researchers, practitioners, policymakers, 
and the general public. As in the 1992 National Adult 
Literacy Study (NALS), adults were assessed in households 
in prose, document and quantitative literacy. The booklet 
designs were based on the 1992 NALS to allow for the 
measurement of trends between 1992 and 2003. 

In order to reduce the cost of interviewers traveling to 
households, the NAAL involved a stratified four-stage 
cluster design that resulted in 18,500 completed assessments 
administered to adults age 16 and older. In the NAAL, 
counties were grouped to form Primary Sampling Units 
(PSUs), which were stratified and selected in the first stage. 
In the second stage, Secondary Sampling Units (SSUs) were 
formed and selected within the sampled PSUs. The SSUs 
were individual census blocks, or groups of adjacent blocks 
with at least 60 households (HHs) formed within tract 
boundaries. Subsequently, households were selected within 
SSUs, and one sample person (1 SP) was randomly selected 
for household sizes up to 3(B <3), and two persons (2 SPs) 
were selected for household sizes greater than 3(B>3), 
where B denotes the number of eligible persons per house- 
hold. This rule followed the within-household sampling 
approach used in the first cycle of NAAL (NCES 2001), 
conducted in 1992. An evaluation of the selection rule was 


conducted using the current US household size distribution 
and intraclass correlation coefficients computed from the 
2003 survey. In doing so, an evaluation of within-household 
sampling under a two-stage design (Clark and Steel 2007) is 
extended to a four-stage design, as used in the NAAL 
survey and some generalizations are made to multi-stage 
samples with different cost ratios. 

The data used for the evaluation include literacy mea- 
sures from three scales derived from three types of literacy - 
prose, document, and quantitative. For more information 
about the NAAL types of literacy, refer to http://nces.ed. 
gov/NAAL/fr_tasks.asp. Two types of estimates are used; 
averages (e.g., average prose literacy score) and percentage 
of adults at some level of literacy (e.g., percentage Below 
Basic prose literacy). For a discussion of the literacy levels 
used in NAAL, see http://nces.ed.gov/NAAL/perf_levels.asp. 
In addition to the NAAL data, the evaluation also uses US 
sample data from the international Adult Literacy and 
Lifeskills (ALL), which was conducted by Statistics 
Canada. The US sample in 2003, sponsored by NCES, was 
part of a comparative study that measured the skills of adults 
in several countries. Similar to the NAAL, the ALL was a 
multi-stage clustered sample survey and measured prose and 
document literacy, as well as numeracy (OECD 2005). The 
NAAL sample was much larger (18,500 completes) than the 
ALL sample (3,400 completes), and the target population 
for NAAL included ages 16+ while the target population for 
ALL included 16 to 65 year olds. Table | provides a 
summary of each survey’s design and structure. 
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Table 1 
Features of the NAAL and ALL surveys 
Survey Area sample Completes Data collection Assessments Ages Within-HH sampling rule 
NAAL PSUs, SSUs 18,500 Screener Prose 16+ B<3,b=1 
households, Persons Interview Document B>3, 
Assessment Quantitative b=2 
ALL PSUs, SSUs, 3,400 Screener Prose 16-65 B<3,b=1 
households, Persons Interview Document B>3,b=2 
Assessment Numeracy 
Note: PSU = Primary Sampling Unit, SSU = Secondary Sampling Unit, 6 = sample size, B = household size. 


A discussion of the design considerations that helped 
form the evaluation of the within-household sampling rules 
is provided in Section 2. Section 3 discusses the compu- 
tation of intra-household correlations under multi-stage 
sample designs and focuses on incorporating the clustering 
impact from the initial stages of sample selection when 
deciding on a within-household selection rule. An eval- 
uation of selection rules was conducted using data from the 
in-person adult literacy surveys and the results are provided 
in Section 4. Finally, a brief summary is given in Section 5. 


2. Design considerations 


There are a number of factors that need to be considered 
when evaluating the within-households selection rules for 
surveys such as NAAL and ALL. The remainder of this 
section will discuss the impact of the following factors on 
within-household sampling: household burden, clustering 
persons within households, differential sampling rates, 
multi-stage sampling, cost considerations, computerized 
systems, domains of interest and household composition. 

Household burden. For the adult literacy surveys, the 
interview and the assessment take about an hour and a half 
to administer in total. Therefore, one concern about 
selecting more than one person per household is the increase 
of burden to the household and the impact on response rates. 
However, there is no significant difference (0.05 signif- 
icance level) in the refusal rates between 1- and 2-SP 
households in ALL and NAAL as shown in Table 2. 

Clustering persons within households. Kish (1965) 
discusses the benefits of a cluster sample to a simple 
random sample. A cluster sample typically has a lower cost 
per person, however the unit variance is higher and it causes 
greater complexities in statistical analysis. Kish introduced 
the concept of a design effect (DEFF), which measures the 
increase in variance due to deviations from a simple random 
sample, such as clustering persons within households. Many 
surveys limit the selection to one sample person (SP) per 
household because of concerns over the increased clustering 
effect (i.e., increasing effect on variance estimates) asso- 
ciated with multiple SPs per household. The DEFF due to 
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clustering can be expressed as: DEFF.y =1+ (b —1) Rho, 
where b =¥(M,/M)b,, M, =number of households of 
size B, M =number of households, and b, =sample size 
of persons within households of size B (Kish 1965). This 
DEFF component increases when the sample size within a 
household increases or when the value of the intracluster 
correlation (Rho) increases. As given in Cochran (1977), 


Rho can be approximated as: 


5 


9), 


v 
7? 
2 


Rho =1- 


where 


and 


where a is the number of sampled households, and 6 is the 
number of sampled persons per household. The DEFF due 
to clustering is examined further for different within-house- 
hold sampling rules in the next section. 

Differential sampling rates. A clustering effect is not the 
only factor that increases the variance. Increases in variance 
are also due to differential sampling rates (resulting in 
differential weights). Under a 1 SP per household strategy, 
the increase is directly related to the variation in household 
size since the sampling rate could vary from | out of 1 to | 
out of 7 or more. The DEFF due to differential sampling 
rates 1s expressed as: DEFFl.,—=>(p,/h,) > (Dake) 
where p, = N,/N, N, =number of eligible persons in the 
population in households of size B, N =number of eligible 
persons in the population, and k, = sampling rate within 
households of size B (Kish 1965). Under certain 
conditions, the overall DEFF can be expressed as the 
product of the clustering and differential sampling rate 
components: DEFF = DEFF,, x DEFFye. Kalton, Brick 
and Lé (2005) suggest this product is applicable when the 
weights are random or approximately random. 
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Table 2 

Refusal rates by 1- and 2-SP households for the adult literacy surveys 
Survey Subgroup 

NAAL 1-SP households 


2-SP households 
ALL 1-SP households 
2-SP households 


Note: SP = sample person. 


To arrive at a self-weighting sample, persons within 
households would need to be selected at a constant rate. 
However, a rate-based approach is not preferred in most 
surveys since it would result in walking away from a portion 
of single-person households and, thus, would increase the 
cost of the survey. We limit the alternative rules under 
consideration to those with a minimum of | SP per house- 
hold. Out of concern for burdening households, the maxi- 
mum sample size was set to two. The sampling rules under 
consideration are: 


1. Takel: | SP no matter the household size. 

2. Rule2: 1 SP for household sizes up to 2; otherwise 2 
SPs are selected. 

3. NAAL3: ISP for household sizes up to 3; otherwise 
2 SPs are selected. 

4, Rule4: 1 SP for household sizes up to 4; otherwise 2 
SPs are selected. 

5. Frac5: take at least | SP, but no more than 2 SPs and 
the sample size is a fraction. That is, if the sample 
size for a household with two eligible persons is 1.6, 
then two persons are selected 60 percent of the time 
at random, and one person is selected 40 percent of 
the time. 


While the Takel approach does not attempt to reduce the 
DEFF due to differential sampling rates, it is not subject to a 
clustering impact. However, the other four approaches listed 
above provide a reduction in the differential sampling rate 
component while introducing a clustering effect. In the case 
of Frac5, under the assumption that 1 -weights are used, as 
assumed throughout this paper, the approach would result in 
the most reduction in the differential sampling rate 
component. The m-weights approach is based on the 
unconditional selection probability of the person within the 
household. If the actual sample size within a household is 
used in the form of ratio weights, the differential sampling 
rate increases the benefit is less clear and depends on Rho. 
Figure | illustrates the best options under a two-stage house- 
hold design with fixed effective sample size of persons, 
without any cost considerations. The US national household 
size distribution from the 2007 Current Population Survey 
was used for this illustration. As shown in Figure 1, the 
fractional approach is the best rule for a wide range of 
values of Rho. The fractional approach can be programmed 
into a computerized system when enumerating and selecting 
household members (more discussion on computerized 


Laie, 


Refusal rate % 
16.3 
ILS} 
17.6 
16.2 


systems follows). If computerized systems are not available 
for screening, then the best approach for low values of Rho 
is the more clustered approach, Rule2; and the NAAL3 rule 
is best for Rho values greater than about 0.34. 

Multi-stage sampling. For multi-stage area designs, the 
clustering impact of sampling within households is affected 
by the clustering due to PSUs and SSUs. As pointed out by 
Kish (1965), the clustering of households and persons 
within PSUs and SSUs increases the sampling variance (i.e., 
units within PSUs and SSUs are more similar to each other). 
The incremental impact of clustering within households 
may be dampened by the domination of the PSU and SSU 
variance components (however, the magnitude of the impact 
will differ depending on the type of estimate and variable). 
That is, more persons within a household can be selected for 
surveys with a large amount of clustering due to the first 
two stages of sampling. Details of this distinction are 
provided in Section 3. 

Cost considerations. The cost of screening a household in 
a 1 SP per household design versus the cost of interviewing/ 
assessing a second person in a household is investigated in an 
extensive analysis presented later . 

Computerized systems. Computerized systems, such as 
Computer-Assisted Personal Interview (CAPI), have the 
capability of handling fractional sample sizes. That is, the 
random selection of 1 or 2 SPs given a pre-assigned 
fractional sample size can be programmed. Computerized 
systems also have the capability of sorting the list of eligible 
persons and selecting 2 SPs with a systematic random 
sample. Another benefit is that the selection program can be 
tested and validated prior to data collection. 

Domains of interest. As mentioned earlier, optimal 
within household sampling depends on the magnitude of 
the clustering effect associated with the variable of interest. 
The clustering effect may be much smaller when the 
variable is associated with a subgroup of the population, 
rather than the entire population. For example, when a key 
reporting domain is gender in a survey of the adult 
population, the reporting category of males is likely to have 
an average of | SP per household and less likely to have 
2 male SPs which would introduce a clustering effect. 
Therefore, when there are multiple domains of interest in a 
typical household, it is often beneficial to select more than | 
SP within a household. Refer to Mohadjer and Curtin 
(2008) for an example of design considerations for a survey 
with focus on multiple subgroups of the population. 
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Takel 


waneownnnwn Rule? 


- = - - NAAL3 


wwe me « Ruled 


woe cme «Frac 


Figure 1 Initial analysis of within-household selection rules 


Household composition. Lastly, one may want to 
consider the household composition and relationships of 
persons within a household when devising the selection 
tule. Table 3 displays values of Rho for various relation- 
ships between household members, for household with 2 
SPs in the NAAL survey. Rho varies greatly by household 
member relationships. The relationships were derived from 


gender and age. 


3. Estimation of intra-household Rho and DEFF 
under multi-stage sampling 


The discussion about Rho thus far has been related to a 
two-stage design, but both NAAL and ALL have four stages 
of sampling. The total variance can be decomposed into four 
between-variance terms attributable to PSUs, SSUs, house- 
holds and persons, as follows: 


5) 2 2 2 2 
O7 =Opsy + Ossupsuy + OuHssuy * OpeRs(HH): 


As shown below, when applying a two-stage approach to 
estimate Rho for a four-stage sample design, the numerator 
not only contains the between household component, but 
also contains contributions from the between PSU and 
between SSU components inflating the values of Rho for 
our purpose. 
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Ahee he Frenso be Gacy 1 Fssuiesu pa Onna 
Or Or 

Therefore, when evaluating rules for within-household 
sampling under a multi-stage design, we assume the PSU 
and SSU design will be the same in the future. This can be 
accomplished by limiting our focus to within SSU sampling. 
Therefore, the computation of Rho is contained within 
SSUs, that is, it is done in a compact manner without effect 
from the PSU and SSU components. We refer to this as the 
compact (i.e., within SSU) Rho denoted by Rho , expressed 
as: 


2 
OHH(SsU) 


Rho = 


Const) fs Orman 

Using the compact Rho , we now derive the estimated 
DEFF under a multi-stage sample design for the purpose of 
determining optimal within-household sample sizes. The 
variance of an estimate (6) with b persons per household 
can be decomposed as: 


a} a) 2 Z 
n Op Ossupsu) . CuH(ssu) . OPERS(HH 
Var(0) a! PSU a ) ii ( ) 4 (HH) 


Mpsu Assy Nyy bray 


where, Mpcy> Mssy> Nyy and bn,,, are the sample sizes of 
PSUs, SSUs, households and persons, respectively. 
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Table 3 
Rho for NAAL assessment scores by household member relationships 
Estimate Siblings Child-guardian Married Others 
Number of households with 2 SPs 111 205 180 434 
Average prose score 0.42 0.35 0.70 0.59 
Average document score 0.40 0.27 0.72 0.54 
Average quantitative score 0.46 0.36 0.63 0.56 
Percentage Below Basic prose 0.52 0.41 0.79 0.67 
Percentage Below Basic document 0.54 0.40 0.78 0.60 
Percentage Below Basic quantitative 0.51 0.41 0.77 0.65 
Then the DEFF due to clustering, relative to taking one aD Ee 
person per household and bn,,,, households is: Rho = ea 
DEFF s 
j , A : OHH (SsU) (2 oe ) 
oO Oa (oe Ga 2 2 HH(SSU) PERS(HH) 
PSU SSU(PSU) 4 HH(SSU) ir: ae * Dinca oe ee Pits 
Mpsu Mssu arin Nay > 2 2 2 2 
= Me Sys tag Ue Ve Opsy cn Ossuipsu) it OHH (ssu) a OpERS(HH) 
oO 
PSU, —SSU(PSU) , SHH(SSU) , “PERS(HH) riniee one bry, bray 
Mesu USS bn, bys, 3 
A : Onn(ssu) 
Opsy , Ossuesu) , | 2 D 2 b 
- “ay (GiHsuy+ Pers t+ —)) Ohnssu)) = 2a 
Mesu Assu Na 


2 2 D 
= eS vo4vy.» Opsy . Ossucesu) . CHH(SsU) | OPERS(HH) 
on Ossu(psu) 1 - ‘ zc 

PSU if 


2 2 n Nex bn bn 
b (Sinssuy + Operscey ) PSU SSU HH HH 
Mesu "ssu Na 2 
' = O HH (ssu) 
2 = = > 
(oye ie) 2 é 
bran PSU hs SSU(PSU) byw Opsy M brug Ossu psu) ae b, ut 
Piss Medes 2 - HH(SSU) ' 9 PERS(HH) 
* 
+ (1+(6-—1) Rho oe es 
co + co ( ( ) ) ** . - ~ . 
oy HH(SSU) 7 ~ PERS(HH) The Rho measure is a useful expression for the intra- 
Gore ee household correlation under a multi-stage design, which is 
* 9 ) 
brn - equal to Rho when O5.,; = Ossypsy) — 0. The compact 
Mpsu Mggy z PSU SSU(PSU) 
; 2 +] Rho measure is useful for evaluating optimal sample sizes 
OrH(ssu) + OPERS(HH) while varying the variance ratio k . Note, however, that in 
x50 E; 
* * = ana firs 
k’ +(1+(6-1)Rho’) general Rho is a function of Nesy> Mssu and the total 
= To Ties 1 sample size of persons, whereas Rho does not depend on 
+ 
these. 
. . . * . . 
where, As shown in Table 4, the variance ratio k , which is the 
2 Ae variance from the first two stages divided by the variance 
ie} SSU(P 
bry | 2% + se from the last two stages, for a one person per household 
eae Se OU design, ranges from 0.68 to 1.61 across types of assessments 
2 2 . 
(Gurissu) a Opers(HH) ) and estimates for the ALL survey. 


Table 5 shows estimates for Rho (computed under a two- 


g) 2 
= , 10} 5 5 * 20k 
es She ae Ce stage design assumption), the compact Rho and Rho 


ey aL 5S mai aralle (computed under a multi-stage design assumption where 
I (co? re k =1) for average NAAL and ALL literacy assessment 

b Onnissuy + Opersuiny) bie i ; ; 
Nay scores. When including the clustering impact from the first 


two stages of the four-stage design, the values of the 
compact Rho and Rho™ are much smaller than Rho. For 
(b—1)Rho’ example, the two-stage Rho for the NAAL average prose 
P pe score is 0.57 and the compact Rho is equal to 0.33 and 
+ Rho is equal to 0.17. The table also shows that values of 

a rn the compact Rho’ for average scores are at about the same 
where, level for NAAL (range from 0.32 to 0.33) and ALL (range 


Alternatively, DEFF!i" can be expressed as: 


DEFF!! =1+ 


clu 
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from 0.29 to 0.39). There is some variation by the type of 
estimate as well: values of Rho’ for ALL are 0 to 0.2 lower 
for the percentage in Level | or 2 than for the average 
scores. Values of Rho’ can also vary by household size as 
shown in Figure 2 in Appendix A. 


4. Evaluation and results 


We compared the current sampling rules with optimal 
sampling rules by minimizing a_variance-cost (VC) 
function, which is the product of the DEFFs (i.e., variance 
increase) due to clustering and weighting, and a cost 
function that is used by Kish (1965): 


clu 


VC=DEFE, «DEER. « nl, ~ ‘a, 
b 


where c,, = cost per added person and c,,,, = cost per added 
household. Note that /b represents the number of 
sampled households. To account for the differential 
clustering effects for each household size B, we replace 
DEFF;,. with: 


clu 
+> PER Glebe —1)Rho,) 
DER a ee 


= k+1 
where Rho, is computed as described in Appendix A. 

Note that the VC function represents the additional cost 
of increasing the overall sample size to offset the increase in 
variance due to the DEFF components. Table 6 provides the 
results for optimal integer solutions as computed by a 
computational algorithm which is described in Appendix B. 
The table shows that as the cost ratio increases from 0.5 to | 
for k =1, we would want to take more persons per 
household, that is, 2 out of 2 instead of 1 out of 2. As the 
variance ratio goes from | to 3 for optimal integer solutions, 


Table 4 , 
Values of A for the ALL sample 


ALL estimate 
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the only change is for household size of 2 and cost ratio of 
0.5. That is, when the variance ratio is equal to 3, it is 
beneficial to take 2 out of 2 instead of 1 out of 2. 

Table 6 also gives the results when fractional sample 
sizes are allowed. The variance and cost ratios for NAAL 
and ALL tend to be about 1, where it appears that selecting 
1 out of 1, 1.6 out of 2, and 2 otherwise is the best rule. The 
effects of cost and variance ratios are clearer under the 
fractional sample sizes when compared to the integer 
solutions. 

If the cost of conducting a screener is small in relation to 
the cost of interviewing, then variances can be reduced 
using the fractional walk-away approach. Table 6 shows 
optimal walk-away sample sizes. Under this approach, for 
example, a sample size of 0.9 indicates that we walk away 
from 10 percent of the households where B=1. If the cost 
of screening is a very small portion of the cost of 
interviewing, then the optimal design may involve walking 
away from many more households. 

Under the likely NAAL/ALL parameters for cost ratios 
(Gx IC, =1) and variance ratios (k =1), when 
compared to the Takel approach, the VC function can be 
reduced by about 9 percent by using the NAAL/ALL 
sampling rule, 19 percent by using the optimal integer 
solution, 20.4 percent using the optimal fractional solution, 
and 20.6 using the optimal walk-away approach. In general, 
the gains from deviating from the Takel approach grow as 
the cost per additional households (i.e., screening) increases. 
The average cluster sizes for each approach are given in 
Table 7. For the NAAL and optimal integer rule, the 
average Cluster size indicates the percentage of households 
with 2 SPs. For example about 6 percent of the households 
would have 2 SPs under the NAAL3 strategy. 


Average prose score 

Average document score 

Average quantitative/numeracy score 
Percentage in Level | or 2 prose 
Percentage in Level | or 2 document 
Percentage in Level | or 2 numeracy 


Table 5 ; 
Values for Rho, Rho , and Rho _ for literacy assessment scores 


Estimate 
Number of households with 2 SPs 


Rho Rho. Rho” 
NAAL ALL NAAL ALL NAAL’ ALL 


930 162 930 162 930 162 


Average prose score 0.57 0.60 0.33 0.38 0.17 0.19 
Average document score 0.53 0.50 0.33 029 On 0.15 
Average quantitative/numeracy score 0.54 0.58 0.32 39 0.16 0.20 
Percentage Below Basic(NAAL)/Level lor 2 (ALL) prose 0.65 0.44 0.42 0.28 0.21 0.14 
Percentage Below Basic (NAAL)/Level 1 or 2 (ALL) document 0.61 0.37 0.39 0.28 0.20 0.14 
Percentage Below Basic quantitative (NAAL)/Level 1 or 2 (ALL) numeracy 0.62 0.36 0.40 0.17 0.20 0.09 


* 


Note: Rho” is computed assuming k* =1. 
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Table 6 


TAT. 


Optimal expected number of persons per household by type of person sampling method and household size (B) 


Person Sampling Method 


Integer Fractional Walk-away 
k Cry eC = ebm) be hae B=1) OB =I B= 3 b=4 b= 1 ~b=2 ~B=3" B=4 
i 0.5 ] 1 2 2 1 1.4 y 2 0.6 113) 2 2 
i l 1 ) 2 2 1 1.6 2 2 0.9 1.6 2 2 
1 2 1 2, 2, 2 | iL 2 2 1 A) 2 2 
3 0.5 l 2 2 2 ] 1.6 2 2 0.8 eS 2 2 
3) 1 1 2 2 2 1 1.8 2 2 ] 1.8 2 2 
3} 2 1 2 2 2 1 2 2 2 1 2 2 2 
Table 7 


Percent reduction of NAAL3 and optimal solutions from Takel strategy and average cluster sizes 


Percentage reduction from Takel strategy 


Average cluster sizes 


k Cun/.C, NAAL3 Integer _ Fractional = Walk- away | NAAL3 Integer Fractional Walk- away 
] 0.5 8.2 13.0 15.8 18.0 1.06 1.18 1.38 1 a 
1 1 9.1 19.2 20.4 20.6 1.06 1.68 1.48 1.45 
1 2 99 26.1 26.1 26.1 1.06 1.68 1.63 1.63 
3 (OSI 8.6 TES 18.7 19.0 1.06 1.68 1.48 iW 359/ 
3 1 9.5 DN PRYS) 23.9 1.06 1.68 1.58 1.58 
3 2 10.4 30.2 30.2 30.2 1.06 1.68 1.68 1.68 


Lastly, a sensitivity analysis was conducted by varying 
the values of Rho. A regression model was fit on the 
percentage reduction from the Takel strategy of the VC 
function, with the independent variables being the approach 
(NAAL3, integer, fractional, walk-away), cost ratio (0.1, 
0.5, 1, 2, 10), variance ratio (1, 3, 5) and Rho’ (+/- 0.1). For 
the range of data, Rho’ had a limited impact (parameter 
estimate -7.4 with an associated standard error of 4.5) on the 
percentage reduction of the VC function, while the other 
factors had more of an impact. 


5. Summary 


Several design considerations were taken into account 
when evaluating the within - household selection rule for the 
NAAL and ALL surveys, including taking into account 
clustering effects from initial stages of sampling. To 
facilitate the evaluation, we formulate a way to incorporate 
PSU and SSU variance contributions into the computation 
of the DEFF due to clustering and the intra-household 
correlation when deciding how many persons and how 
many households to select in a multi-stage sample design. In 
doing so, we introduce compact Rho’ measure, which is 
computed within the SSU so it is not impacted by the PSU 
and SSU variance components. This is useful when 
determining the DEFF due to clustering within households, 
while varying the contribution to the total variance from the 
PSU and SSU stages of selection in multi-stage sample 


designs. The measure Rho’ is introduced as an expression 
for the intra-household correlation under a multi-stage 
design, taking into consideration the contribution to total 
variance from the first two stages of selection. 

In addition, a computational algorithm was developed to 
compute optimal sample size solutions, incorporating the 
DEFFs due to clustering, differential sampling rates, and 
costs. 

In general, the main factors on the percentage reduction 
of the VC function from the Takel approach are the level of 
dominance from the PSU and SSU variance components in 
multi-stage sampling, the cost ratio and the rule used. For 
the range of data evaluated, Rho. had limited impact on the 
reduction in VC from the Takel approach. In general, the 
NAAL rule improves on the widely-used Takel approach. 
The optimal integer rule improves on the NAAL tule. 
However, the optimal fractional rule has limited gains over 
the optimal integer rule. The optimal walk-away rule has 
gains over the other rules for lower cost ratios. Lastly, when 
the first two variance components dominate and cost ratio is 
high, then the integer, fractional and walk-away rules are 
essentially the same. 
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Appendix A 


Estimates of Rho by household size 


Survey estimates are not attainable for Rho’ by house- 
hold size since only 1 SP was selected for household size of 
3 or less and since the sample size was too small to create 
estimates for each household size of 4 or more. Therefore, 
estimates of Rho by household size are modeled using 
Census data. Figure 2 shows Rho on the y-axis and 
household size on the x-axis. The upper line is from the US 
Census public-use microdata sample (PUMS) file for 
education attainment for ages 25+. The upper line shows 
that education attainment is more similar among households 
with two adults, perhaps more likely to be married couples. 
It shows a drop off when going from two to three adults. We 
captured the variation in households size by computing the 
ratio of Rho’ for the NAAL prose literacy scores to the Rho 
for the Census PUMS education attainment among 
households with B>3 and applying the ratio to the PUMS 
Rho across all household sizes. The resulting values are the 
estimates of compact Rho, tomb leslie 


Appendix B 


Computational algorithm 


A computational algorithm was developed to arrive at 
optimal within-household sample sizes for each household 
size B. The algorithm was constructed to generate optimal 
integer or fractional solutions that capture the effects of 
clustering, differential sampling rates and cost, under the 
constraints of at least one selected person per household and 
no more than 2. Here are the steps of the algorithm (all 
processing runs converged within four iterations): 


— Initialize by setting b=1 for all values of B 


(Takel). 
— Compute DEFK,, > DEFF,,,, C,, Cag, and VC(0). 
= Do t=} to:5: 
=Do. b=) to 11. 
— Compute DEFEy;, DEFF,., C,» Cay» and 
VC for all 1< 5, <2, given the set of bz, 
for all B'# B. 
— Identify the 5, with the smallest value of 
Ve 
Sunriver 


— If VCV) = VCU —1) then stop. 
— End. 


Household Size 


—@e@— NAAL Prose Indirect a Census PUMS, Educ 25+ 


Figure 2 Estimates of Rho for NAAL by household size 
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2010 International Methodology Symposium 
Statistics Canada 

October 26-29, 2010 

Ottawa, ON, Canada 


Social Statistics: The Interplay among Censuses, Surveys and 
Administrative Data 


Statistics Canada’s 2010 International Methodology Symposium will take place at the Crowne Plaza Hotel, 
located in the heart of downtown Ottawa, from October 26-29, 2010. 


The Symposium will be titled “Social Statistics: The Interplay among Censuses, Surveys and 
Administrative Data”. Members of the statistical community, such as those from private organizations, 
governments, or universities, are invited to attend, particularly if they have a special interest in statistical 
or methodological issues resulting from the use of multiple sources of data (censuses, sample surveys or 
administrative data). 


The first day will consist of workshops, while the following days will consist of both plenary and parallel 
sessions covering a variety of topics. Additional research and results may be presented via poster 
sessions. 


The presentations will be related to the methodological aspects of using multiple sources of data. Topics 
may include: 


Sampling Frames and Sample Design 

Coordinating Samples 

Content and Questionnaire Design 

Data Collection Methods and Acquisition of 

Administrative Data 

e Supplementing Survey Data with 
Administrative Data 

e Administrative Data for Direct Estimation 

e Statistical Databases from Administrative Data 

(e.g., Population Registers) 


Imputation 

Weighting and Estimation 

Dissemination and Data Access 

Record Linkage Techniques 

Record Linkage Software 

Measurement Errors 

Response Burden 

Treatment of Nonresponse 
Confidentiality, Privacy and Ethical Issues 
Small Area Estimation 


Visit our Internet site regularly to obtain further details about the program, workshops, registration, 
accommodation, tourism information and more at 


http://www.statcan.gc.ca/conferences/symposium2010/index-eng.htm 
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FOURTH INTERNATIONAL 
CONFERENCE ON ESTABLISHMENT 
SURVEYS (ICES IV) PLANNED 

FOR 2012 


Planning is underway for the Fourth International Conference on Establishment Surveys (ICES IV). 
If you’ve attended any of the past conferences, you know how invaluable they have been to the 
literature and practice of establishment surveys. If you are newer to the establishment survey field, 
you will find the conference especially rewarding. Since the last ICES held in 2007, many new 
techniques have been developed by practitioners around the world. A major strength of the 
conferences is the strong international presence, both in the program development and attendance. 
Over 400 people from 94 countries attended ICES III. On June 11-14 2012, survey practitioners 
from government agencies, academia, private sector and more will gather at the Sheraton Centre 
Montreal in Quebec, Canada for ICES IV and continue the tradition of sharing innovative techniques 
and best practices to address common issues. 


Sponsorship of the meetings is being provided by the American Statistical Association, ASA Section 
on Survey Research Methods, ASA Section on Government Statistics, International Association of 
Survey Statisticians, and the Statistical Society of Canada. Administrative support for ICES IV will 
be provided by the American Statistical Association, similar to previous ICES meetings. Also, many 
other organizations and government agencies are or will be providing support for the conference. 


With the support of these many great organizations and the diverse gathering of individuals 
involved in establishment surveys, we anticipate that ICES IV will prove to be another fruitful 
conference in the valuable ICES series. So, save the date, June 11-14, 2012, and join 
practitioners from around the globe in Montreal, Canada! You can participate in the growing ICES 
IV program discussing current issues, future vision, and cutting-edge methods in surveying 
businesses, farms and institutions. Expect updates on participation and program details to ICES IV 
through this newsletter and the upcoming ICES IV website. Inquiries may be directed to 


ices4@amstat.org. 
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Before finalizing your text for submission, please examine a recent issue of Survey Methodology (Vol. 32, No. 2 and onward) 
as a guide and note particularly the points below. Articles must be submitted in machine-readable form, preferably in Word. 
A pdf or paper copy may be required for formulas and figures. 
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Layout 


Documents should be typed entirely double spaced with margins of at least 1 inches on all sides. 

The documents should be divided into numbered sections with suitable verbal titles. 

The name (fully spelled out) and address of each author should be given as a footnote on the first page of the 
manuscript. 

Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 


Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” 
and “‘log(-)”, etc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if they are 
to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, @; 0, O, 0; 1, 1). 

Italics are used for emphasis. 


Figures and Tables 
All figures and tables should be numbered consecutively with arabic numerals, with titles that are as self explanatory 
as possible, at the bottom for figures and at the top for tables. 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference is cited, 
indicate after the reference, e.g., Cochran (1977, page 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year of 
publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 


Short Notes 


Documents submitted for the short notes section must have a maximum of 3,000 words. 
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Waksberg Invited Paper Series 


The journal Survey Methodology has established an annual invited paper series in honour of Joseph 
Waksberg, who has made many important contributions to survey methodology. Each year a prominent 
survey researcher is chosen to author an article as part of the Waksberg Invited Paper Series. The paper 
reviews the development and current state of a significant topic within the field of survey methodology, and 
reflects the mixture of theory and practice that characterized Waksberg’s work. 

Please see the announcements at the end of the Journal for information about the nomination and 
selection process of the 2012 Waksberg Award. 

This issue of Survey Methodology opens with the tenth paper of the Waksberg Invited Paper Series. The 
editorial board would like to thank the members of the selection committee Leyla Mohadjer (Chair), 
Daniel Kasprzyk, Elisabeth A. Martin and Wayne Fuller for having selected Ivan P. Fellegi as the author of 
this year’s Waksberg Award paper. 


2010 Waksberg Invited Paper 
Author: Ivan P. Fellegi 


Ivan P. Fellegi is Chief Statistician of Canada Emeritus at Statistics Canada. He was the Chief 
Statistician of Canada from 1985 to 2008, and it was during that period that Statistics Canada was 
ranked by The Economist as the best statistical office in the world. Dr. Fellegi contributed significantly 
both to survey methodology and to the effective management of a large organization during his long 
career at Statistics Canada. 


He has published extensively on statistical methods, on the social and economic applications of 
statistics and on the successful management of statistical agencies. Some of his methodology papers 
have become landmarks: topics covered include sample design, edit and imputation, record linkage, and 
the analysis of survey data. He has actively participated on several committees: he was chair, 
Conference of European Statisticians of the United Nations Economic Commission for Europe (1993- 
97); Chair of the Committee on Statistics of the Organisation for Economic Cooperation and 
Development (2004-2008); past President of the International Statistical Institute, the International 
Association of Survey Statisticians, and the Statistical Society of Canada; and past chair of the Board of 
Governors, Carleton University (1995-97). He has a long list of honours that include: Officer of the 
Order of Canada; recipient of the Outstanding Achievement Award of the Public Service of Canada; the 
Order of Merit of the Hungarian Republic; the Career Achievement Award of the Canadian Policy 
Research Initiative, La Médaille de la ville de Paris, Member of the Hungarian Academy of Sciences, 
Gold Medal of the Statistical Society of Canada and the Robert Schuman medal of the European 
Community. He is the recipient of Honorary Doctorates from Université de Montréal, Université du 
Québec (Institut national de la recherche scientifique), Simon Fraser University, McMaster University, 
Carleton University, and the University of Ottawa. He is an Honorary Member of the International 
Statistical Institute, Honorary Fellow of the Royal Statistical Society. 
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Abstract 


The paper explores and assesses the approaches used by statistical offices to ensure effective methodological input into their 
statistical practice. The tension between independence and relevance is a common theme: generally, methodologists have to 
work closely with the rest of the statistical organisation for their work to be relevant; but they also need to have a degree of 
independence to question the use of existing methods and to lead the introduction of new ones where needed. And, of 
course, there is a need for an effective research program which, on the one hand, has a degree of independence needed by 
any research program, but which, on the other hand, is sufficiently connected so that its work is both motivated by and feeds 
back into the daily work of the statistical office. The paper explores alternative modalities of organisation; leadership; 
planning and funding; the role of project teams; career development; external advisory committees; interaction with the 


academic community; and research. 


Key Words: Methodology; Official statistics; Statistical organisation; Research; Relevance; Independence. 


1. Introduction 


It is a great honour to accept an award named after Joe 
Waksberg. Joe has been a close personal friend, as well a 
good friend of Statistics Canada. 

I came to know Joe during his latter years in the Bureau 
of the Census when Morris Hansen asked me to become a 
member of what was then a most imposing methodology 
advisory committee of the Bureau chaired by Bill Cochran. 
Subsequently, in the late 1970s, when Statistics Canada had 
serious problems of image and of internal management, 
Statistics Canada asked a group of prominent statisticians to 
review what was wrong. At my recommendation, Joe was 
one of the three wise men asked to take part (the others 
being Richard Ruggles and the chairman, Claus Moser). Joe 
immediately agreed and in his inimitable low-key manner 
made invaluable contributions to Statistics Canada; the very 
helpful basic message being that while we had serious 
management problems, there was nothing much wrong with 
our methodology. 

A few years ago the Census Bureau honoured me by 
asking to give one of their annual “wise elders” lectures. 
While I objected strongly on the grounds that I neither 
considered myself “wise”, nor “elder”, in the end I accepted 
their kind invitation. With typical grace, Joe took the time to 
show up for my talk, even though he was well into the 
middle of his eighties but still very busy as chairman of the 
board of WESTAT. We had a really good chat; and that was 
the last time I saw him. What a career; what a life! 

So it is not only a professional honour to accept the 
Waksberg Award, but also a personal pleasure to be 
associated with Joe one more time. 


v 


I was told that generally the recipients of the Waksberg 
Award give an overview of an area of methodology. But 
while, as you know, I did spend the first half of my career as 
a methodologist, I stopped being a practitioner some 
decades ago — although I am still an ardent advocate (see 
Fellegi 2004). So I thought I would join the first half of my 
career — methodology — to the second half — management of 
statistical offices. I shall therefore, talk about the lessons I 
learnt about the organisation of applied methodological 
work and methodology research in national statistical 
offices; what works well and what less so (I assume that the 
basic conditions for an effective methodology function 
exist: there is a supply of trained statisticians in the country, 
the statistical office has a functioning infrastructure, salaries, 
if they are not competitive, are at least within sight of what 
is offered in the private sector, and so on). 

I have two overall themes. Managing the tension 
between independence and relevance is one of them: 
generally, methodologists must work closely with the rest of 
the statistical organisation for their work to be relevant. 
Indeed, they must strive to serve the objectives of external 
clients, represented inside the office by subject matter 
experts. However, for them to be effective they must enjoy 
the necessary independence to question the use of existing 
methods, and to champion new ones if they believe they 
could reduce costs or increase statistical quality. 

But the effectiveness of methodology also depends on a 
strong methodology research capacity which, on the one 
hand, has the necessary independence needed by any 
research program, but which, on the other hand, is 
sufficiently connected to on-going work so that it is both 
motivated by and feeds back into the daily practice of the 
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statistical office. The organisation of methodology research 
will be my second them. 
But first | want to define what I mean in the present 


context by the terms methodology, relevance and 
independence. 

2. Some definitions 
Methodology 


The unique service performed by methodology is to 
maximise statistical quality given an imposed budget (or 
conversely). They do so through the application of statistical 
practice that is either based on statistical theory or on 
organized empirical observation. In other words method- 
ologists are wizards of the relevant statistical theories; but 
also of “organised empirical observation” where formal 
theory abandons us. By organised empirical evidence | 
mean designed experiments or analytically assessed 
experience. So I am including all organized knowledge 
about the use of methods and approaches that result in the 
objective of maximising quality within a budget — or 
conversely, minimising the budget needed to achieve a 
stated quality level. 

This would include such things as sample design, 
estimation, data editing, imputation, exploitation of 
administrative data, record linkage, seasonal adjustment, 
questionnaire design, measurement of accuracy and quality 
assurance of censuses and surveys, the use of experimental 
designs, and so on. 

Methodologists are predominantly mathematical statis- 
ticians and they work on the applied end of their subject. 
Because of the interdisciplinary nature of official statistics 
they interact with survey managers, experts in data collec- 
tion, IT personnel, geographers, sociologists, economists, etc. 


Relevance 


Methodology is re/evant if the day to day practice of the 
statistical office is actually based on sound methodology. A 
major issue in the organization of methodology is how to 
balance the intrinsically service nature of methodology 
against the need for the function to provide strong and 
effective guidance. Much of the paper will deal with all 
those arrangements needed to ensure the objective of 
relevance. 

In the case of methodological research, relevance means 
that the research is both motivated by and informs applied 
work. 


Independence 


The notion of independence of methodology means the 
ability to provide sound methodological guidance to 
projects, irrespective of the hierarchical arrangement of line 
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organisations that can be debated but not ignored; and that 
this debate is based on evidence, not authority. So my 
definition of independence is not that methodologists should 
be able to “do their own thing” but rather that they should 
have an authoritative voice. 

Independence is frequently contrasted with relevance. 
Since relevance is about embedding methodology into 
practice, this is often attempted by building methodological 
services right into the fabric of subject matter organisations. 
By contrast, independence is thought to be enhanced by 
giving methodologists their own organisation(s). In this 
sense, therefore, there is a tension between the two. How- 
ever, I would argue that relevance cannot be achieved if 
methodological guidance is ignored, so appropriate arrange- 
ments to ensure independence are necessary for relevance. 

Independence of methodological research is different: it 
is generally meant to refer to an environment in which 
researchers have predominant say in the choice of their 
topics. Clearly, providing researchers with such an environ- 
ment does create a permanent tension with the need to be 
relevant at all times, particularly when it is not at all obvious 
in the short term where the relevance lies. 

In my discussion of how to balance relevance and 
independence of both the applied methodology function and 
of methodology research I will describe not only orga- 
nisational arrangements, but a wide variety of tools and 
arrangements that should be considered in the pursuit of this 
objective. I shall use Statistics Canada as a concrete 
illustration. What I wish to emphasize is that the issue is 
much more complicated than what the terms “centralisation” 
and “decentralisation” denote for whichever of these basic 
organisational arrangements is adopted, many additional 
tools are needed to offset their disadvantages while 
maintaining their intrinsic advantages. Indeed, | have 
organised the rest of the paper around a discussion of the 
main tools (in choosing these tools for discussion, I 
borrowed from the paper by Brackstone 1997) involved 
under the following headings: 


Organisation; 
- Leadership; 
- Planning and funding; 
Project teams; 
- Career development; 
Advisory Committees; 
- Interaction with the academic community; and 
- Research. 


3. Organisation 
General thoughts 


National statistical offices differ in the way they organise 
their methodology functions. In some it is distributed to 
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individual parts of the agency, each responsible for a given 
subject (e.g., labour). In other agencies decentralisation is 
only partial, e.g., to broader subject matter areas (such as 
demography or business statistics). The US Bureau of the 
Census, for example has largely decentralised _ its 
methodology function. By contrast, Statistics Canada and 
the Australian Bureau of Statistics have largely centralised 
it. Many factors influence the organizational choice. For 
example, in France and in India where all professionals 
share similar background in statistics and are largely 
recruited from a single teaching institution the accent is 
obviously on centralizing training and to a lesser extent 
research. 

The traditional arguments are that decentralisation 
favours relevance and centralisation favours independence. 
However, the aim should be to have both. That being the 
case the question is how we can enhance independence in 
the case of decentralised methodology organisations, and 
relevance in the case of centralised ones. 

Decentralisation, while potentially serving to underscore 
relevance, has some built-in disadvantages. Since each unit 
to which methodology is decentralized is necessarily smaller 
than it would be in more centralized options, it is less likely 
to facilitate specialisation and research. It is also less likely 
to encourage cross-fertilisation by methodologists working 
on other issues. Also, since the line organisations to which 
methodology is decentralised are typically not headed by 
methodologists, this model tends to result in lower 
hierarchical positions for the heads of the decentralised 
methodology units. In case of “conflicts” — and these will 
be inevitable because of different perceptions of priority, 
cost, quality and so on - other things being equal it will be 
more difficult for methodologists to defend their 
professional advice. If left without a counterweight, this 
kind of organization could get out of balance. 

A critical counterweight could be a “chief meth- 
odologist” who reports directly to the head of the statistical 
office and inevitably is called upon to play an important role 
in long term planning and resource allocation. The “Chief 
Methodologist” could have his hand strengthened if given 
direct line responsibility for a strong research and 
development function which could serve as the “intellectual 
home base” for the decentralised methodology staff. 

Project teams, brought together for large developments, 
are another important tool to enhance independence in the 
case of centralised organisations. Such projects — which if at 
all significant are necessarily multi-disciplinary — are carried 
out by ad hoc project teams which operate off-line from the 
agency’s line organization. The organization of project 
teams is a matter to which Statistics Canada devoted 
considerable attention and it has been refined over time. 
Among its elements there is the feature that whenever 
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professional disputes within the teams arise and the team 
believes that their solution requires outside intervention, the 
dispute is referred to a senior group of which someone from 
the staff of the “chief methodologist” is a member (this 1s 
automatically the case if the methodologist comes from a 
centralised group). It 1s this senior steering group that can 
contribute to protecting independence. 

Consideration might also be given to providing some 
additional tools for the “chief methodologist’”: he could be 
authorised and funded to develop a strong methodology 
training program; he could be given a strong role in the 
allocation and career development of the methodology staff; 
he could be supported by a strong external advisory 
committee; and so on. These features recognize that the 
role of “chief methodologist” is particularly delicate and 
could become more so if his place in the hierarchy were 
dependent on the size of the staff he controls directly 
without provision — as there is in some countries — to have 
his level of access and place in the ladder depend on his 
personal prestige rather than on the size or level of 
supporting staff. 


Centralisation: the Statistics Canada model 


Many years ago Statistics Canada opted for the 
centralised model (see Fellegi 1996) and that option was 
never seriously challenged (it was challenged for a brief 
period of time in the late seventies but in concrete terms the 
challenge did not get anywhere), and put in place a number 
of practices designed to reduce the threat that centralisation 
might result in diminished relevance. 


1. Project teams: These are inter-disciplinary and 
include as a matter of course a methodologist but 
they are headed by a project manager whose 
association with the project is subject matter and who 
is likely to assume operational responsibility for the 
completed project. 

2. Funding: much of the funding for the methodology 
function is controlled by the rest of Statistics Canada. 
Program areas (within limits that I will describe 
further) are free to spend their money on buying 
methodology services or not so long as they do not 
fall foul of the agency’s quality norms and accepted 
standards. With their budget largely on the line year 
after year, this accountability means that it is very 
much in the interest of methodologists to be 
responsive to the needs of the Agency’ s Programs. 

3. Organisation of the methodology function: it largely 
parallels the organisation of Statistics Canada. There 
are four methodology divisions: three of them 
provide methodology input to three different areas of 
the agency, while the fourth is devoted to research. In 
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fact, the three applied methodology divisions are 
themselves organised by subject matter in parallel 
with the manner in which the bureau is organised 
(regular rotation of methodology staff ensures broad 
development opportunities for methodologists). 

4. Co-location of methodology staff: methodologists 
are occasionally physically moved to the offices of 
the subject matter areas whose surveys they help to 
design. This is an additional measure taken to ensure 
that they focus on the right issues. 

5. Finally, as a matter of sound practice, methodologists 
conduct — and follow up on the results of — client 
satisfaction surveys which provide feedback on all 
aspects of their performance and first and foremost 
on the relevance thereof. 


4. Leadership 


General thoughts 


Leadership is crucial. The leader of the methodology 
function, in addition to a proper academic background and a 
great deal of experience in methodology, must possess a 
strategic vision and a personality that inspires confidence. 
This is an intrinsically difficult function. In the over- 
whelming majority of offices operational and subject matter 
considerations are the ones that recetve the most attention. 
In such an environment an authoritative voice for meth- 
odology is needed to ensure adequate resources for the 
methodology function itself, but even more importantly to 
lead the entire agency in directions that are technically 
sound, and conversely to hold back initiatives that cannot be 
supported by sound methodology. ‘“Soundly based” 
involves more than good survey design that uses the best 
available current knowledge. It also includes the notion of 
strategic planning of research, experiments and pilot surveys 
so as to improve the likelihood that whatever knowledge 
will be needed in the future will be available. For the 
opinions of methodologists to make a proper impact they 
must be supported by a leader whose unchallenged 
personal competence is combined with a seat at the 
statistical agency’s most senior table 

If methodologists do not belong to a central organization 
within the statistical agency it is all the more important for 
their senior representative to be highly placed in the 
hierarchy since under a decentralized scheme he would not 
have direct line authority for (the bulk of) methodology 
resources. 


Centralisation: the Statistics Canada model 


Centralisation provides another lever to enable the leader 
of the methodology function to carry out his proper role as it 
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enables him to make rational and authoritative assignments 
of the resources under his direction to the most strategic 
projects. The top advocate of sound methodology in 
Statistics Canada has the status of Assistant Chief 
Statistician (ACS) — the rank immediately below that of the 
Chief Statistician of Canada. In order to secure such a high 
position in a government bureaucracy, the line responsibility 
of the ACS (Methodology) includes statistical standards 
(classifications and central registers), as well as informatics 
(IT). While the position is therefore responsible for more 
than methodology, it is by long tradition (over 35 years) 
filled by someone who is a noted expert on methodology 
and can therefore speak at the top table authoritatively about 
its importance in general as well as in the context of 
particular projects. 


5. Planning and funding 


General thoughts 


The effective functioning of methodology (as indeed the 
entire statistical office) greatly depends on the existence of a 
proper planning system (see Fellegi 1992 and Brackstone 
199K): 

Planning is a necessary condition to ensure that 
resources are allocated rationally at all times. 

It also serves to mark explicitly the beginning and the 
end of development projects and therefore constitutes 
the ideal opportunity for methodology to “sign off’ on 
the proposed design of new projects. 

Lastly, the planning system creates an opportunity for 
methodology to make an explicit judgement on whether 
a planned new venture can respect simultaneously its 
budgetary constraints, the agency’s quality standards, 
and the expected maintenance bill. In fact, the planning 
system also provides an opportunity for all represen- 
tatives of the disciplines involved in the creation of a 
new project (its planning or its implementation) to “sign 
off’ as a mark of assuming professional responsibility 
for the adequacy of its funding or for the integrity of its 
functioning. 


Such a planning system is essential where the main 
disciplines (methodology, systems development, data 
collection, efc.) are centralised for otherwise the orga- 
nisations responsible cannot make provisions for the needed 
resources. But, for more subtle reasons, decentralised offices 
need it just as much: to provide an explicit forum for the 
leaders of methodology (and, indeed, other key disciplines), 
to make their input during the critical formative stages of 
new projects. 
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Centralisation: the Statistics Canada model 


Every new project or major redesign is approved within 
Statistics Canada’s planning system. In preparation for its 
consideration, a comprehensive budget is developed and all 
major disciplines which are required to contribute sign off 
on the appropriateness of the proposed design and 
operational modalities. If the project is approved, its budget 
is divided up and distributed to participating disciplines, 
including methodology. In tur, these organisations 
“contract” to deliver the agreed contributions within the 
approved budgets. A project manager oversees both 
progress and expenditures, with authority to reassign 
resources, if necessary. 


The budget of the Methodology organization is 
composed of five distinct sources. These are designed, on 
the one hand, to facilitate the sound planning of the use of 
methodology and its thorough integration into the work of 
the Agency, and on the other to secure for it the needed 
funding. 


1. The contribution of methodology to developmental 
projects is guaranteed by the planning process of 
Statistics Canada, as indicated above. The financial 
contribution to the methodology budget from these 
sources may vary from year to year, but there is a 
reasonable overall stability (facilitating the hiring 
and development of permanent staff). They ac- 
count for almost 30 percent of the total meth- 
odology budget. These projects typically involve 
major redesigns, often with significant experi- 
mentation and innovation. 

2. But methodological contributions are also needed for 
maintenance (quality control, monitoring of various 
errors including variance estimation where relevant, 
minor design adjustments, efc.). For these activities 
there are core resources set aside and more or less 
permanently allocated by broad subject matter. This 
constitutes the second component of the methodology 
budget and it accounts for somewhat less than 25%. 
While for methodology this “on-going” work 
accounts for less than 25% of their workload, for 
Statistics Canada as a whole “on-going” work 
accounts for over 90% of our budget. This is because 
of the innovative nature of methodology work. 

3. A third component comes from supplementary 
resources funded directly by the beneficiary subject 
matter divisions who, in effect, make savings from 
their other expenditures to avail themselves of 
additional methodology contributions. These 
supplementary funds account for a by no means 
negligible 20% or so of the methodology budget. The 
very fact that subject matter divisions consider 
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methodology sufficiently valuable to fund method- 
logical advice directly says a lot about the health of 
the relationship and of the extent to which it is 
valued. The funds in question are for a mixture of 
projects including enhancements short of a major 
redesign of on-going projects. They also strengthen 
the awareness of methodology staff of the need to 
remain relevant for their users. The kind of service 
they provide has a direct bearing on the amount of 
resources that are made available to them. 

4. The fourth part of the methodology budget (about 
20 per cent) comes from externally funded projects, 
typically from the budget of surveys funded by other 
departments. No more needs to be said about them. 

5. The final part (7 per cent) is for research. This is a 
“block fund’, meaning that a certain fixed amount of 
funds is allocated for the research function. The 
annual allocation is governed by a mechanism 
described below. 


The intricacies of the funding mechanism and_ the 
multiplicity of funding sources are a reflection of the care 
exercised in the agency to balance the virtues of 
independence with those of relevance. 


6. Project teams 


General thoughts 


The use of project teams in developmental projects helps 
to strengthen relevance without it being necessarily at the 
expense of independence. But project teams are not a 
universal panacea as everything depends on establishing 
appropriate checks and balances. In centralised orga- 
nisations project teams, most often headed by a project 
manager from the sponsoring subject matter area, help to 
nudge the participating methodology staff to pay proper 
attention to the objectives and constraints of projects. 
Nonetheless there remains an inherent danger that the 
project manager will not give sufficient weight to the 
considered advice of methodologists. 

Project teams in decentralised organisations are just as 
important to ensure that the views of methodologists are 
given appropriate weight. Here, however, the dice are 
clearly weighted in favour of relevance and against 
independence. Moreover, an exaggerated emphasis on 
“relevance” has its danger as well since it can lead to local 
optimisation. Local optimisation is a situation where 
surveys are optimised without regard to agency wide 
objectives. An example might be a situation where surveys 
are customised to an extent such that the introduction of 
important efficiencies through the use of agency-wide 
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standards and general systems becomes difficult (the 
widespread use of generalized approaches, systems and 
tools can be a source of considerable agency-wide 
efficiencies: they shorten implementation times, reduce the 
expenditure on both systems development and maintenance, 
facilitate staff rotation, efc. However, generalized systems 
might lack some features which could enhance the 
efficiency of any given operation. Decentralized orga- 
nizations are more likely to favour such locally developed 
solutions in preference to agency-wide standard tools, even 
though the latter might lead to substantial /ong-run 
efficiencies). 


Centralisation: the case of Statistics Canada 


In Statistics Canada project teams working on major 
development projects are accountable and report to steering 
committees typically composed of the heads of the 
participating disciplines. A steering committee approves the 
broad project strategy, and serves, if needed, as a forum to 
which issues can be referred that could not be resolved 
within the team itself. In practice such appeals are rare and 
are restricted to cases where professional principles or truly 
strategic issues are involved. Steering committees ensure 
that issues do not get resolved within the project team on the 
basis of rank but rather on the basis of professional merit. 

Methodologists serving on project teams carry out a dual 
function: 

At a strategic level, they help ensure that the overall 
survey design achieves the project’s substantive 
objectives, while striking a balance between reliability, 
cost, timeliness and respondent burden. While striving 
for this balance concerns the entire project team it is the 
methodologists who provide the framework and 
techniques that must be considered in seeking the 
optimum balance. 

At a tactical level the methodologists provide the 
statistical methods and tools that are incorporated into 
the overall survey design: the sample design, the 
estimation and weighting approach, quality control, 
editing and imputation strategies, coverage checks, 
analytic methods and the like. 


Project teams function best in an organisation dedicated 
to making decisions on the basis of merit; where everyone 
can pose questions and expect reasoned answers; one that is 
devoted to making maximum use of the expertise of 
everyone involved. 


7. Career development 


General considerations 


Career development is essential for all professional 
groups, and it involves both formal training as well as 
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formal and informal approaches to facilitate on-the-job 
learning. Methodology staff, in my view, requires special 
attention in this respect. The reason is that universities in 
general offer few, if any, courses in survey methodology 
(there is an increasing number of exceptions, although their 
numbers are still far from overwhelming. A most notable 
one is the Joint Program in Survey Methodology, University 
of Maryland. But there are also degree programs on official 
statistics in the UK, Ireland and New Zealand which include 
survey methodology). Since a thorough professional 
knowledge is essential for both relevance and independence, 
most statistical offices wanting to maintain a_ strong 
methodology staff have no alternative to having a carefully 
designed career development program — whether meth- 
odology is organised in a centralised or decentralised 
manner. 

For the courses to be relevant, it is desirable that a 
substantial portion of courses should be taught by staff 
members who are themselves active practitioners. This is 
easier arranged in centralised organisations where the senior 
methodologists can not only deploy staff to do teaching 
(typically on a part time basis), but can also arrange suitable 
replacements for them in their current project work. 

The broader aspects of career development are also easier 
arranged in centralised organisations: they can more readily 
manage the periodic assignment of staff to different types of 
survey work, attendance at scientific conferences, the 
provision of research opportunities to those interested in and 
capable of doing part-time research work, and most 
importantly the service of apprenticeships under more 
experienced methodologists. 


The case of Statistics Canada 


Training, not only in methodology, is emphasized by 
Statistics Canada (see Statistics Canada 1995). Overall, 
expenses on training amount to about 3% of its budget (or 
$15 million) on formal training — plus a great deal more on 
various means of career development. But, in line with the 
centrality of training in methodology, the percentage of 
methodology budget spent on it is almost twice as much 
(bordering on 6 per cent in the 2008-09 fiscal year). 

Training is provided in formal courses within Statistics 
Canada’s Training Institute which currently (in 2009) offers 
some 20 courses in methodology, ranging in level from 
introductory courses to graduate level material. Most 
courses are taught by in-house staff, occasionally university 
personnel, mostly from local universities, are engaged if 
they are interested to teach and/or help develop our staff in 
other ways (e.g., consultation) (in the latter modality we 
have been particularly fortunate in having had _ the 
contributions of Professor J.N.K.Rao of Carleton 
University over a period of some decades). 
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All recruits have to take a basic six weeks course which 
teaches (and provides practice in) survey design, survey 
operations, processing and analysis. This introductory 
training serves a multiplicity of purposes. Since the same 
basic six-week course in survey work is taken by a// new 
professionals, it helps early on to inculcate in everyone a 
basic knowledge of all that is involved in survey work ; and, 
even more importantly, to drive home the critical 
importance of inter-disciplinary team work. It is also at this 
stage that new recruits from other disciplines are exposed 
for the first time to the requirements of methodology in 
survey design 

Career development involves much more than training. 
The staff, particularly at the earlier stage of their career, is 
regularly given opportunities to work on different types of 
work: demographic, socio-economic, business surveys, use 
of administrative records, record linkage, etc. Significant 
numbers also attend scientific conferences. For example, 
during the last several years some 17 percent of the 
methodology staff attended various Canadian and 
international professional conferences per annum. Staff is 
also encouraged to work on research projects and publish 
findings in peer reviewed journals, including Statistics 
Canada’s Survey Methodology. Finally, for many years now 
Statistics Canada has organised an international meth- 
odology symposium to which leading research personnel 
from around the world are invited. These symposia are, of 
course, open to all Statistics Canada personnel and most 
methodologists choose to attend them. 


8. Advisory Committee 


General considerations 


A Methodology Advisory Committee can serve a most 
useful function (a) ensuring sound methodology practices, 
(b) integrating these practices into the daily work of 
statistical organisations, and (c) training staff. But the 
Committee can only be effective if (a) its advice is sought 
on significant issues of methodology and (b) there are 
mechanisms to ensure that the Committee’s views are given 
due weight. I have observed Methodology Advisory 
Committees playing an equally useful role in a centralised 
office (Statistics Canada) and in a decentralised one (the 
Bureau of the Census in the 1960s). 


The case of Statistics Canada 


Statistics Canada’s Methodology Advisory Committee 
plays a key role. There are several factors that contribute to 
its usefulness and standing: 

The personal standing of the Committee’s members is 
part of the reason. 
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Every significant project of Statistics Canada is referred 
to the Committee for advice. 

The Committee’s review is facilitated by the 
preparation of a paper for each item of the agenda 
which is introduced by a brief oral presentation by staff. 
Designated members of the Committee serve as formal 
discussants of each item on the agenda. The discussants 
present their views formally. Given that most of the 
papers are prepared by mid-career staff, these 
discussions make not only a substantive contribution to 
the projects that are discussed, but also to the training of 
the staff concerned — and that of the audience. 

Meetings of the Committee are attended not only by a 
large number of the relevant methodologists, but also 
by senior personnel of the subject matter division 
concerned, including often the Chief Statistician as well 
as one or two of his assistants. 

The Committee meets regularly: twice a year, for a day 
and a half on each occasion. 

The Committee regularly reviews the follow-up arising 
from its conclusions and formal recommendations; this 
helps ensure that their advice is taken seriously. 


9, Research 


General considerations 


I am taking it for granted that for this audience I do not 
need to spend time underscoring the intrinsic importance of 
research in a statistical agency. But let me stress the 
following points: 

Careful thought should be given to organising the 
research function in a manner that maximises both its 
relevance and the likelihood that its benefits will be 
successfully transmitted into daily practice. It is crucial 
to avoid the twin dangers of research being self-serving, 
or alternatively so completely task-oriented that it 
becomes pedestrian. 

Research needs to be adequately funded. 

In-house research needs to develop and to maintain 
close links with relevant extramural research. 


The case of Statistics Canada 


One of the four methodology divisions is formally 
devoted to full time research. But the research is organised 
in a particular manner. Even though the research budget 
provides for the equivalent of 22 full time research staff, the 
research division itself has only six full time members. The 
remaining budget is assigned to finance the part-time 
research work of some other 70 methodologists. This 
arrangement serves a variety of purposes. First, it 
contributes to the relevance of research. Secondly, it 
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contributes to the adoption of the results of research. And 
thirdly, it helps morale for while not everyone wants to do 
research (or is able to do so), many want to try their hand at 
it. And the very act of conducting some research, by those 
capable of it, leads to more open mindsets and a better 
informed practice. 

We are trying to ensure that the particular projects 
approved for research are in line with the broad research 
priorities of Statistics Canada, but at the same time leave 
some scope for self-initiated research. We do this by 
establishing broad priorities each year and _ inviting 
proposals in those areas from staff. The proposals are 
subject to formal adjudication: the best ones are selected and 
staff are assigned to work on them. Senior advice and 
guidance is provided by the Director of the Statistical 
Research and Innovation Division and its small permanent 
staff. 

The following are additional measures that help the 
quality of research carried out: 

The possibility of publishing papers in Survey 
Methodology, Statistics Canada’s own _ publication, 
serves as an incentive. While the peer review of the 
articles is rigorously managed by an_ international 
editorial board, the existence of a local yet prestigious 
outlet for methodology research represents a visible 
commitment by senior management. 

We regularly co-author papers with well known 
external research personnel (both Canadian and non- 
Canadian). 

We hold regular methodology interchanges with 
methodology staff in the US Bureaus of the Census and 
of Labour Statistics. 

We participate actively in Canadian, American and 
international statistical organisations. 


10. Concluding comments 


As indicated in the introduction, the bulk of the paper 
was devoted to the tools that should be considered by 
statistical offices in establishing and supporting the 
methodology function and the associated research, tools that 
in appropriate combination can enhance both the 
professional independence as well as the relevance of the 
function. I want to emphasise, however, that this is not a 
cook book. More important than all the tools is the 
environment: whether the statistical office welcomes 
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questioning and ensures that substantive questions are 
answered in substance; whether change is intrinsically 
frowned upon; whether it fosters collegiality; whether 
intelligent risk taking is encouraged or frowned upon; 
whether experiments are welcomed, assessed on_ their 
merits, and acted upon. These are the attributes that come 
from the top leadership of the statistical office and tools 
cannot substitute for them. Under the wrong leadership the 
best methodology staff (or, indeed, the best statistical office 
itself) will wither. But the contrary is not true: it is essential 
to have a careful understanding of the subtle balances 
advocated in this paper, as well as a careful deployment of 
the tools that give them effect. And even then, only a long 
term strategy can succeed. 

I am completely certain that Joe would agree with my 
conclusion (see Waksberg 1998). 
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Design for estimation: 
Identifying auxiliary vectors to reduce nonresponse bias 


Carl-Erik Sarndal and Sixten Lundstrém ! 


Abstract 


This article develops computational tools, called indicators, for judging the effectiveness of the auxiliary information used to 
control nonresponse bias in survey estimates, obtained in this article by calibration. This work is motivated by the survey 
environment in a number of countries, notably in northern Europe, where many potential auxiliary variables are derived 
from reliable administrative registers for household and individuals. Many auxiliary vectors can be composed. There is a 
need to compare these vectors to assess their potential for reducing bias. The indicators in this article are designed to meet 
that need. They are used in surveys at Statistics Sweden. General survey conditions are considered: There is probability 
sampling from the finite population, by an arbitrary sampling design; nonresponse occurs. The probability of inclusion in 
the sample is known for each population unit; the probability of response is unknown, causing bias. The study variable (the 
y-variable) is observed for the set of respondents only. No matter what auxiliary vector is used in a calibration estimator (or 
in any other estimation method), a residual bias will always remain. The choice of a “best possible” auxiliary vector is 
guided by the indicators proposed in the article. Their background and computational features are described in the early 
sections of the article. Their theoretical background is explained. The concluding sections are devoted to empirical studies. 
One of these illustrates the selection of auxiliary variables in a survey at Statistics Sweden. A second empirical illustration is 
a simulation with a constructed finite population; a number of potential auxiliary vectors are ranked in order of preference 
with the aid of the indicators. 


Key Words: Calibration weighting; Nonresponse adjustment; Nonresponse bias; Auxiliary variables; Bias indicator. 


1. Introduction 


Large nonresponse is typical of many surveys today. This 
creates a need for techniques for reducing as much as 
possible the nonresponse bias in the estimates. Powerful 
auxiliary information is needed. Administrative data files 
are a source of such information. The Scandinavian coun- 
tries and some other European countries, notably the 
Netherlands, are in an advantageous position. Many poten- 
tial auxiliary variables (called x-variables) can be taken from 
high quality administrative registers where auxiliary vari- 
able values are specified for the entire population. Variables 
measuring aspects of the data collection is another useful 
type of auxiliary data. Effective action can be taken to 
control nonresponse bias. Beyond sampling design, design 
for estimation becomes, in these countries, an important 
component of the total design. Statistics Sweden has 
devoted considerable recourses to the development of 
techniques for selecting the best auxiliary variables. 

Many articles discuss weighting in surveys with non- 
response and the selection of “best auxiliary variables”. 
Examples include Eltinge and Yansaneh (1997), Kalton and 
Flores-Cervantes (2003), and Thomsen, Kleven, Wang and 
Zhang (2006). Weighting in panel surveys with attrition 
receives special attention in, for example, Rizzo, Kalton and 
Brick (1996), who suggest that “the choice of auxiliary 
variables is an important one, and probably more important 


than the choice of the weighting methodology”. The review 
by Kalton and Flores-Cervantes (2003) provides many 
references to earlier work. As in this paper, a calibration 
approach to nonresponse weighting is favoured in Deville 
(2002) and Kott (2006). 

Some earlier methods are special cases of the outlook in 
this article, which is based on a systematic use of auxiliary 
information by calibration at two levels. Recently the search 
for efficient weighting has emphasized two directions: (i) to 
provide a more general setting than the popular but limited 
cell weighting techniques, and (ii) to quantify the search for 
auxiliary variables with the aid of computable indicators. 
Sarndal and Lundstrém (2005, 2008) propose such indica- 
tors, while Schouten (2007) uses a different perspective to 
motivate an indicator. An article of related interest is 
Schouten, Cobben and Bethlehem (2009). 

This content of this article has four parts: The general 
background for estimation with nonresponse is stated in 
Sections 2 to 4. Indicators for preference ranking of x- 
vectors are presented in Sections 5 and 6, and _ the 
computational aspects are discussed. The linear algebra 
derivations behind the indicators is presented in Sections 7 
and 8. The two concluding Sections 9 and 10 present two 
empirical illustrations. The first (Section 9) uses real data 
from a large survey at Statistics Sweden. The second 
(Section 10) reports a simulation carried out on a con- 
structed finite population. 
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2. Calibration estimators for a survey 
with nonresponse 


A probability sample s is drawn from the population 
U = {1, 2,..., k,..., N}. The sampling design gives unit k 
the known inclusion probability 2, = Pr(k es) >0 and the 
known design weight d, =1/m,. Nonresponse occurs. The 
response set r is a subset of s; how it was generated is 
unknown. We assume rcs CU, and r non-empty. The 
(design weighted) response rate 1s 


mrisieg 
dd 


(if A is a set of units, ACU, asum >¥,., will be written 
as > ,). Ordinarily a survey has many study variables. A 
typical one, whether continuous or categorical, is denoted y. 
Its value for unit k is y,, recorded for k er, not available 
for k €U—-r. We seek to estimate the population y-total, 
Y= yy, Many parameters of interest in the finite 
population are functions of several totals, but we can focus 
on one such total. 

The auxiliary information is of two kinds. To these 
correspond two vector types, x; and x,. Population 
auxiliary information is transmitted by x,, a vector value 
known for every k €U. Thus },x, is a known population 
total. Alternatively, we allow that ),,x;, is imported from 
an exterior source and that x, is a known (observed) vector 
value for every kes. Sample auxiliary information is 
transmitted by en a vector value known (observed) for 
every kes; the total ¥,,x, is unknown but is estimated 
without bias by >,d,x,;. The auxiliary vector value 
combining the two types is denoted x,. This vector and the 
associated information is 


x =[ x= diy ; (2.2) 


‘eo, 12 Th ° 
Xj Do FXe 


Tied to the k" unit is the vector (y,,x,,7,). Here, 7, is 
known for all kK eU, y, forall k er, the component x, of 
x, carries population information, the component x, of x, 
carries sample information. 

Many x-vectors can be formed with the aid of variables 
from administrative registers, survey process data or other 
sources. Among all the vectors at our disposal, we wish to 
identify the one most likely to reduce the nonresponse bias, 
if not to zero, so at least to a near-zero value. 

We consider vectors having the property that there exists 
a constant non-null vector ps such that 


(2.1) 


wx, =1 forall kKeU (2.3) 


“Constant” means that 4 #0 does not depend on &, nor ons 
or r. Condition (2.3) simplifies the mathematical derivations 
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and does not severely restrict x,. Most x-vectors useful in 
practice are in fact covered. Examples include: (1) 
x= (.4)) 8 where, x, is) the value for “unit (epotea 
continuous auxiliary variable x; (2) the vector representing a 
categorical x-variable with J mutually exclusive and 
exhaustive classes, X, =Y, =(Vigo- jo» Ye)» Where 
Yj =1 if & belongs to group j, and y, =0 if not, 
j=l; 223J 52 G3) thervecton x, cusedsto.codilyaciwo 
categorical variables, the dimension of x, being J, + 
J,—1, where J, and J, are the respective number of 
classes, and the ‘minus-one’ is to avoid a singularity in the 
computation of weights calibrated to the two arrays of 
marginal counts; (4) the extension of (3) to more than two 
categorical variables. Vectors of the type (3) and (4) are 
especially important in statistics production in statistical 
agencies (the choice x, = x,, not covered by (2.3), leads to 
the nonresponse ratio estimator, known to be a usually poor 
choice for controlling nonresponse bias, compared with 
x, =(L,x,)’, so excluding the ratio estimator is no great 
loss). 

The calibration estimator of Y =}, y,, computed on the 
data y, for k er, is 


Yoat = i WV 


with w, =d,{1+(X->,d,x,)(%,d,x,x,) X,}. The 
weights w, are calibrated on both kinds of information: 
Ew, x= Xo which @amplicss >. 1k) ek, Wd 
¥,W,x, =X,d,x,. We assume throughout that the 
symmetric matrix },d,x,x, is nonsingular (for compu- 
tational reasons, it is prudent to impose a_ stronger 
requirement: The matrix should not be ill-conditioned, or 
near-singular). In view of (2.3), we have Yo4, =, Wd; 
with weights w,=d,v, where v, = X'(d,d,X,X,) X;- 
The weights satisfy >,.d,v,x, =X, where X has one or 
both of the components in (2.2). 

A closely related calibration estimator is based on the 
same two-tiered vector x, but with calibration only to the 
sample level: 


(2.4) 


Yon, = be dV, (OAe)) 
where 
' F =H 
qip=| Saba ce) ee ex | en mec 
The calibration equation then reads > .d,m,x, = 


>, d,X,, where x, has the two components as in (2.2). The 
auxiliary vector x, serves two purposes: To achieve a low 
variance and a low nonresponse bias. From the variance 
perspective alone, Y..,, is usually preferred to Y.,, because 
the former profits from the input of a known population 
total Y,,x,. But this paper studies the bias. From that 
perspective, we are virtually indifferent between Y.,, and 
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a at» and we focus on the latter. Under liberal conditions, 
the difference between the bias of N'Y.,, and that of 
N'Y, is of order n', thereby of little practical 
consequence even for modest sample sizes n, as discussed 
for example in Sarndal and Lundstrém (2005). 

An alternative expression for (2.5) is 


Yoat = =()/4,x,) B x 


(2.7) 
where 


By .a = =(>, d,x Me) Sy an (2.8) 
is the regression coefficient vector arising from the (d, - 
weighted) least squares fit based on the data (y,,x,) for 
ker. 

A remark on the notation: When needed for emphasis, a 
symbol has two indices separated by a semicolon. The first 
shows the set of units over which the quantity is computed 
and the second indicates the weighting, as in B,,., given 
by (2.8), and in weighted means such as_ j,.,= 
>,4,y,/>,d,. If the weighting is uniform, the second of 
the two indices is dropped asin y, =(1/N)dy yy. 


3. Points of reference 


The most primitive choice of vector is the constant one, 
x, =I for all k. Although inefficient for reducing 
nonresponse bias, it serves as a benchmark. Then m, =1/P 
for all k, where P is the survey response rate (2.1), and Yo, 
is the expansion estimator: 


Vom bs Gi): =i (3.1) 
where N =>, d, 1s design unbiased for the population size 
N. The bias of Y,. can be large. 

At the opposite end of the bias spectrum are the 
unbiased, or nearly unbiased, estimators obtainable under 
full response, when r=s. They are hypothetical, not 
computable in the presence of nonresponse. Among these 
are the GREG estimator with weights calibrated to the 


known population total »,, x;, 


Your = ie ASV; 


where. op = 14+ (dix; —),0,x,) 0,d,X;x,) X;, and 
FUL refers to full response. The unbiased HT estimator 
(obtained when g, =1 forall x) is 


Youe = Ded. 


It disregards the information ¥,,x,, which may be 
important for variance reduction. But for the study of bias in 
this paper, we are indifferent between Y,,,, and Y,.,,. The 


SIN Tee (3.2) 
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difference in bias between the two is of little consequence, 
even for modest sample sizes. We can focus on Y,1; . 


4. The bias ratio 


For a given outcome (s,r), consider the estimates 
Year> Yexp and Yiu, given by (2.5), (3.1) and (3.2) as three 
points on a horizontal axis. Both ¥,,, (generated by the 
primitive x, =1) and Y.,, (generated by a better x-vector) 
are computable, but biased. As the x-vector improves, Y..., 
will distance itself from Voge, and may come near the 
unbiased but unrealized ideal Youn: We consider therefore 
threewdeviations:) (c= oq lev, — Ven and Yau — Yen; 
of which only the middle one is computable. The unknown 
“deviation total’, Hears = ae, is decomposable as 
“deviation accounted for’ (by the chosen x-vector) plus 
“deviation remaining”: 

y, 


EXP _ Your = (Yexp = 


(4.1) 


Year) + Ucar ~ Yeu): 

If computable, Y.,, —Yey, would be of particular 
interest, as an estimate of the bias remaining in Y.,, (and in 
You, ), Whereas Y.yp —Ya, would estimate the usually 
much larger bias of the benchmark, Y,,». The bias ratio for 
a given outcome (s, r) sets the estimated bias of Y.,, in 
relation to that of Y,.»: 


bias ratio = (4.2) 


We scale the three deviations by the estimated population 
size N= >,d, and use the notation A, =A, +A,, where 
T suggests “total”, A “accounted for” and R “remaining”. 
Noting that ¥,.d, (3, - ae =0, we have 


bee Nai esomath ma an ads 


Ar= NT! (out - Vet) 7 X\.@By ai Vee 


N= NG (Voge = Your )= (@& a —%50)'B, 
where X,.9 = 254,X;/2.4,,X,.g =D, 4,X,/X,d,, and 
¥,.q and y,., are the analogously defined means for the y- 


variable. Then (4.2) takes the form 


eas 
A; A; 


= — ' 
(Xa a X,.4) B, 


bias ratio = (4.3) 


Vu d = Ves d 


We have bias ratio = | for the primitive vector x, =1. 
Ideally, we want the auxiliary vector x, for Y.,, to give 
bias ratio ~ 0. For a given outcome (s, 7) and a given y- 
variable, we take steps in that direction by finding an x- 
vector that makes the computable numerator A, = 
(x,., — X,.,)'B, large (in absolute value). This is within our 


rid 
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reach. But whatever our final choice of x-vector, the 
remaining bias of Y.,, is unknown. Even with the best 
available x-vector, considerable bias may remain. We have 
then attempted to do the best possible, under perhaps 
unfavourable circumstances. 

To summarize, for a given outcome (s, r) and a given y- 
variable, the three deviations have the following features: (1) 
A; =Y,.q —¥y-q 18 an unknown constant value, depending 
on both unobserved and observed y-values; (ii) A, is 
computable; it depends on y, for k er and on the values 
x, for kes of the chosen x-vector; (iii) A, cannot be 
computed; it depends on unobserved values y,, and on x, 
for kes. 

To follow the progression of the estimates when the x- 
vector improves, consider a given outcome (s, r). The 
deviation A, can have either sign. Suppose A, > 0, 
indicating a positive bias in Y.,p, as when large units 
respond with greater propensity than small ones. When the 
x-vector in Y.,, becomes progressively more powerful by 
the inclusion of more and more x-variables, A, tends to 
increase away from zero and will, ideally, come near A,, 
indicating a desired closeness of Y.,, to the unbiased Y,,,, . 
As long as the x-vector remains relatively weak, A, < A, is 
likely to hold. When the x-vector becomes increasingly 
powerful, A, moves closer to the fixed A,, a sign of bias 
nearing zero. It could even “move beyond”, so that an 
“over-adjustment”, A,>A;,, has occurred. This not a 
detrimental feature; although A,=A,-—A, is then 
negative, it is ordinarily small (the analyst can only work 
with A ,; itis unknown to him/her whether A, and A, are 
close, or whether the over-adjustment A, >A, has 
occurred). These points are illustrated by the simulation in 
Section 10. If A; <0, these tendencies are reversed. 

The form of (4.3) may suggest an argument which can 
however be misleading: Suppose that a vector x, has been 
suggested, containing variables thought to be effective, 
along with an assumption that y, = B’x, +¢€,, where ¢€, is 
a small residual. Then Y,.,—Y,.7 © (%,.g —X,.q) B, 
(X,., —X,.,)'B, and consequently bias ratio 0, sending a 
message, often false, that the postulated vector x, is 
efficient. One weakness of the argument stems from the 
well-known fact that nonresponse (unless completely 
random) will cause B, to be biased for a regression vector 
that describes the y-to-x relationship in the population. 
Further comments on this issue are given in Section 8. 

Finally, there is the practical consideration that a typical 
survey has many y-variables. To every y-variable corre- 
sponds a calibration estimator, and a bias ratio given by 
(4.3). The ideal x-vector is one that would be capable of 
controlling bias in all those estimators. This is usually not 
possible without compromise, as we discuss later. 
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5. Expressing the deviation accounted for 


The responding unit k receives the weight d,m, in the 
estimator Y.,, =&,d,m,y,. The nonresponse adjustment 
factor _m, =(,d,X,)'(%,4,X,X;,) |X, expands the design 
weight d,. We can view m, as the value of a derived 
variable, defined for a particular outcome (r, s) and choice 
of x,, independent of all )-variables of interest, and 
computable for kes (but used in Y.,, only for 
k er). Using (2.3), we have 


eachioee = dete yam = phe 


oem =>) amy: Say, 
Two weighted means are needed: 
d,m d d,m 
a TaD eai Min apeime Weeeee ence (5.2) 


(i Dera: x aah ‘ nee . D4 


where P is the response rate (2.1). Thus the average 

adjustment factor in Y.4, =>,d,m,y, 18 1/P, regardless 

of the choice of x-vector. Whether a chosen x-vector is 

efficient or not for reducing bias will depend on higher 

moments of the m,. The weighted variance of the m, is 
S= 8. 


mr;d =>),4,(m, Min 7) fa de 


The simpler notation S~ will be used. A development of 
(5.3) and a use of (5.1) and (5.2) gives 


(5.3) 


S;, oT, M,. 4 (™,. 4 7 M,..4 ). (5.4) 
The coefficient of variation of the m, is 
S Mg, 
pe Sew eat (5.5) 
M,. q Mg 


The oa variance of the study variable y is given by 


S; Sr d => 4% - Veg pl ewih 


(when the response probabilities are not all equal, S) = 
S or z 1S not unbiased for the population variance Soe ‘it 


this is not an issue for the derivations that follow). We need 
the covariance 


(5.6) 


Cov(y,m) =Cov(y,™),.. 4 = 


Yd, (m, =.) -Fna) (6.7) 


= d, 

and the correlation coefficient, R,,,, 
satisfying -1< RK, ,, <1. 

The deviation A, =(X,.,—X,.,)B, is a crucial 

component in the bias ratio (4.3). We seek an x-vector that 


= Cov(), m) KS,S m ), 
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makes A, large. The factors that determine A, are seen in 
(5.8) to (5.10). Computational tools (indicators) to assist the 
search for effective x-variables are given in (5.11) and 
(5.12). Their derivation by linear algebra is deferred to 
Section 7, which may be bypassed by readers more 
interested in the practical use of these tools in the search for 
x-variables, as illustrated in the empirical Sections 9 and 10. 
We can factorize A/S, as 


A 4/S, =—Ry m X CV (5.8) 


m* 


Two simple multiplicative factors determine A, /S, : 
The coefficient of variation cv, which is free of y, and 
computed on the known x, alone, and the (positive or 
negative) correlation coefficient R,,. Another factor- 
ization in terms of simple concepts is 


A,/S,=FxR,.xcv 3.9) 


m 


where R,, Sake is the coefficient of multiple correla- 
tion penecer y and x, R°, is the proportion of the y- 
variance Se explained by. the predictor x, and F= 
ae (formula (7.8) states the precise expression for 
oa). As Section 7 also shows, IR, ISR, for any x- 
vector and y-variable; consequently -1< F <1. 
In (5.8) and (5.9), cv, and R,, are non-negative 
terms, while R, ,, and F’ can have either sign (or possibly 
be zero). Hence 


A, /S;, re ol “es CV, = |F| x Ise ms CV, (5.10) 
All of S,, cv,,, 8, .,,,,, and F are easily computed in 
the survey. Both cv,, and R, , increase (or possibly stay 


unchanged) when farther x-variables are added to the x- 
vector; R, ,,, does not have this property. 

To illustrate with the aid of fairly typical numbers, if 
F=0.5;R,,=0.6 and cv,,=0.4, then A,/S, =0.12, 
implying that YouL/N = tae Os12x'S.. That is, the 
estimated y-mean Y.,,/N has become adjusted by 0.12 
standard deviations down from the primitive estimate 
Y.xp/N. The adjustment can be large compared to the 
standard deviation of the estimated y-mean, especially when 
the survey sample size is in the thousands. It remains 
unknown whether or not that adjustment has cured most of 
the biasing effect of nonresponse. 

It follows from (5.8) that 0<|A,|/S, <cv,, whatever 
the y-variable. A shaper inequality is |A,|/S, < R, ex Ov, 
but it depends on the y-variable. Further, if ‘the éorelation 
ratio F stays roughly constant when the x-vector changes, so 
that F ~ Fy, then |A,|/S, ~|Fo|x R,, Xcv,,. 

Although computable for any x-vector and any outcome 
(s,r), A, does not reveal the value of the bias ratio. But A , 
suggests computational tools, called indicators, for com- 
paring alternative x-vectors. By (5.8), let 
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H, =A,/S, =-R, , Xev 


ym m* 


(S111) 


As borne out by theory in Section 8 and by the empirical 
work in Section 10, over a long run of outcomes (s, r), the 
average of H, tracks the average deviation Y.,, —Y 
(which measures the bias of ve ay) Ina nearly perfect linear 
manner when the x-vector changes. This holds indepen- 
dently of the response distribution that generates r from s. 
Since H, can have either sign, it is practical to work with 
its absolute value denoted H,; in addition we consider two 
other indicators, H, and H,, inspired by (5.9) to (5.10): 


A, = |A, | / S, a IR, m | “s CV > 


H, =R, , xev,,; H; =cv,,. 


m? 


(5.12) 


Our main alternatives are H, and H,. Of these, H, is 
motivated by its direct link to A,, which we want to make 
large, for a given y-variable. A strong reason to consider 
H, is its independence of all y-variables in the survey. The 
indicator H, is an adhoc alternative; although H, 
contains a familiar concept, the multiple correlation 
coefficient R, ,, it is less appropriate than H, because the 
correlation coefficient ratio F = —R, ,/R, . may vary 
considerably from one x-vector to another. Both H, and 
H, increase when further x-variables are added to the x- 
vector, something which does not hold in general for H,. 
The use of these indicators is illustrated in the empirical 
Sections 9 and 10. 


6. Preference ranking of auxiliary vectors 


The methods in this paper are intended for use primarily 
with the large samples that characterize government 
surveys. The sample size is ordinarily much larger than the 
dimension of the x-vector. The variance of estimates is 
ordinarily small, compared to the squared bias. However, 
for categorical auxiliary variables, no group size should be 
allowed to be “too small”. It is recommended that all group 
sizes be at least 30, if not at least 50, in order to avoid 
instability. The crossing of categorical variables (to allow 
interactions) implies a certain risk of small groups. It is 
preferable to calibrate on marginal counts, rather than on 
frequencies for small crossed cells. 

In a number of countries, the many available admi- 
nistrative registers provide a rich source of auxiliary 
information, particularly for surveys on individuals and 
households. These registers contain many potential x- 
variables from which to choose. Many different x-vectors 
can be composed. The indicators in (5.12) provide compu- 
tational tools for obtaining a preference ordering, or a 
ranking, of potential x-vectors, with the objective to reduce 
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as much as possible the bias remaining in the calibration 
estimator. 


Scenario 1: Focus on a specific y-variable. The bias 
remaining in the calibration estimator depends on the )- 
variable; some are more bias prone than others. We identify 
one specific y-variable deemed to be highly important in the 
survey, and we seek to identify an x-vector that reduces the 
bias for this variable as much as possible (if more than one 
y-variable needs to be taken into account, a compromise 
must be struck, which suggests Scenario 2 below). For this 
purpose, we use the y-variable dependent indicator H, = 
|A,|/S, =|R,,,|X cv, and choose the x-vector so as to 
make H, large. An ad hoc alternative is to use the indicator 
H,=R,, xcv,,, and strive to make it as large as possible. 


m? 


Scenario 2: The objective is to identify a general purpose x- 
vector, efficient for all or most y-variables in the survey. 
This suggests H, =cv,, aS a compromise indicator, and to 
choose the x-vector that maximizes H,. To that same 
effect, Samdal and Lundstrém (2005, 2008) used the 
indicator S° =H;/P*. They showed that the derived 
variable m, in (2.6) can be seen as a predictor of the inverse 
of the unknown response probability and that choosing the 
x-vector to make S* large signals a bias reduction in the 
calibration estimator, irrespective of the y-variable. 


For each scenario we can distinguish two procedures: 


All vectors procedure: A list of candidate x-vectors 1s 
prepared, based on appropriate judgment. We compute the 
chosen indicator for every candidate x-vector, and settle for 
the vector that gives the highest indicator value. The 
resulting x-vector may not be the same for H, (which 
targets a specific y-variable) as for H, (which seeks a 
compromise for all y-variables in the survey). 


Stepwise procedure: There 1s a pool of available x-variables. 
We build the x-vector by a stepwise forward (or stepwise 
backward) selection from among the available x-variables, 
one variable at a time, using the successive changes (if 
considered large enough) in the value of the chosen 
indicator to signal the inclusion (or exclusion) of a given x- 
variable at a given step. The indicators H,,H, and H, do 
not in general give the same selection of variables. Consider 
two x-vectors, x,, and x,,, such that x,, is made up of 
x, “and ‘an additional evector x .x7 (x7 Xe) ene) 
transition from x,, to x,, will increase the value of H, 

and H,. In each step of a forward selection procedure we 
select the variable bringing the largest increase in H, or 
H,. But the transition does not guarantee an increased 
value for the most appropriate indicator, H,. However, H, 

may be used in stepwise selection in the manner described 
in Section 9. 
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7. Derivations 


For given y-variable and outcome (s, 7), we seek an x- 
vector to make the computable numerator A, = 
(X,., —X,.,)'B, in the bias ratio (4.3) large, in absolute 
value. In this section we prove the factorizations A,/S,, = 
-R,,,XCV, =F xR, xcv,, in (5.8) and (5.9). We note 
first that eye wisi quadratic form in the vector that contrasts 
the x-mean in the response set 7 with the x-mean in the 


sample s. Let 
D2ae axe = idanxe /Satds a7 1) 
Then, with P given by (2.1), 
Cy =r os De (7.2) 


This expression follows from (5.3) and a consequence of 
(2.3), namely, 


5 Nes sc 5 tenga 6) ail, (7.3) 
The vector of covariances with the study variable y is 

C=(¥,4,% -¥ 01 na)) (4). 7A) 
We can then write A, as a bilinear form: 


AVS D'Be= D2 GC (7.5) 


using that Ds x = (x - x, ) = XU by) 

A useful perspective on A, is gained from the geometric 
interpretation of C and D in (7.5) as vectors in the space 
whose dimension is that of x,. We have 


A =D deD) (Gm ©) (7.6) 
where 
S Dz C 
D="D)4(Cz*0)” : 


(7.7) 


For a specific y-variable and a specific x-vector, the 
scalar quantities (D'2'D)'* and (C’Z'C)'”’ represent the 
respective vector lengths of D and C (following an 
orthogonal transformation based on the eigenvectors and 
eigenvalues of £”'). The scalar quantity A represents the 
cosine of the angle between D (which is independent of y) 
and C (which depends on y); hence -1< A <1. 

When the auxiliary vector x, 1s allowed to expand by 
adding further available x-variables, both vector lengths 
(D'='D)'? and (CX 'C)'” increase. The change in the 
angle A may be in either direction; if |A| stays roughly 
constant, (7.6) shows that |A ,|_ will increase. 

A second useful perspective on A, follows by decom- 
posing the total variability of the study variable y, 
Lee Oey = (adi. Two regression fits need 
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to be examined, the one of y on the auxiliary vector x, and 
the one of y on the derived variable m defined by (2.6). To 
each fit corresponds a decomposition of S ; into explained 
y-variation and residual y-variation. The two explained 
portions have important links to the bias ratio (4.3). Result 
7.1 summarizes the two decompositions. 


Result 7.1. For a given survey outcome (s, 7), let D, & and 
C be given by (7.1) and (7.4). Then the proportion of the y- 
variance Ss; explained by the regression of y on x is 


R= (CURE) iS: (7.8) 
The coefficient of correlation between y and the 
univariate predictor m is 


hee (D eee Cy) 2 Dy as 19) 


Consequently, the proportion of Ss; explained by m is 


Rom = (DX C)’ /[(D'E"D)x $2]. (7.10) 


and R? 


x ym 


The proportions Ry satisfy Ry, SR), <1. 


Proof. The proof of (7.8) uses the weighted least squares 
regression of y on x fitted over r. The residuals are 
y, —P(x),, Where $(x), =x,B, with B, given by (2.8). 
The decomposition is 


DEO ye) ae, 


- Eh Oe — f(x), )°. 


5 


= ve) 


The mixed term is zero. A development of the term 
“variation explained” gives >,.d,(¥(x), -J.g) =) ae) 
ee . Thus the proportion of variance neers is 

= D,d, PO, — Ina)’ MEd.) 82] =CEIC/S?, 
ia in (7.8). To show (7.9) we note that the covariance 
(5.7) can be written with the aid of (7.5) as 


Cov(y, m)=—-A,/P=-D'Z'C/P. 


It then follows from (7.2) that R, ,,=Cov(y, m)/(S,S,,) 
has the expression (7.9). The residuals from the regression 
(with intercept) of y on the univariate explanatory variable 
mare $(m),=Y,.,+B,,(m,—m,.,) with B,=Cov(y, m)/S;= 
—P(D'Z'C)/(D'Z"'D). The proportion of variance 
explained is Yd, (Sm), — Vp.) /[(Z,4,) S, 1, which upon 
development gives the expression for Re Py ii (7.10): 
Finally, Ro, < ae follows from the ee ee 
inequality for a tilings form: (D’Z'C) <(D'x 'D) 
(ExEC): 

The inequality Re < Re <1 can also be deduced by 
the fact that, among all pre ictus y, =x, that are linear 
in the x-vector, those that maximize the variance explained 
are }(x), =x,B,, so the predictions }(m),, which are 


Tot 


linear in x, via m,, cannot yield a greater variance 
explained than that maximum. 

Now from (7.9), (7.2) and (7.5), —R,,,,CV,=D'Z'C/ 
S,=A 4/8, , as Claimed by formula (8 8). Moreover, (7.7), 
(7.8) and (7.9) imply —R,,,,/R,, =A, so the correlation 


coefficient ratio F in (5.9) equals the angle A defined by 
(a: 


8. Comments: Goodness of fit, properties of the 
bias and a related selection procedure 


Three issues are examined in this section: (i) The 
relationship between bias and goodness of fit, (ii) the linear 
relation between the expected value of A, =N'(Y-xp — 
Yuu.) and the bias of Y.,, or Y.,,, and (iii) the alternative 
method for selection of auxiliary variables proposed by 
Schouten (2007). 

For the issue (i), recall that the total deviation in Section 
4 is A, =A,+A,, where A, is computable but A; and 
A, are not. If computable, N A, PH Babel fret can be 
an estimate of the bias of Y.., ant of that of Y.,,). A 
small A, is desirable. The question arises: Is this achieved 
when y, =B’'x, +, (with a given vector x,) fits the data 
well? We need to distinguish two aspects: (a) The 
computable fit to the data (y,, x,) observed for k er; and 
(b) The hypothetical fit to the data (y,,x,) for k es, some 
observed, some not. 

A good fit for the respondents, k er, does not guarantee 
a small A,: The weighted LSQ fit using the observed data 


(y.%,) for ker gives the residuals e,.. = ¥,- 
x, By,..g, computable for ker, with the property 
x4 e,.q = (here, the detailed notation B,,,., specified 


in (2.8) is preferable to the simplified notation B,). For 
kes—r,&,.; 18 not computable; it has an unknown non- 
zero mean @,_..g = Ls-r Myx. / usr A. We have 


Yoo i N=—-(-P)e._,. #0. 


‘S=7: 4 an 
Ar 2 0GN: ~ 


Regardless of whether the fit is good (small residuals 
Chr eas Rea near one) or poor (large residuals e,,,..;; Rog 
near zero), the deviation A, given by (8.1) may be large, 
and Y.,, far from unbiased. Even with a perfect fit for the 
respondents (e,,., =0 forall ker, and Ree, =1), there is 
no guarantee that the bias is small. 

A similar inadequacy affects imputation based on the 
respondent data. If the regression imputations ), = 
x,,B,,.q are used to fill in for the values y, missing for 
k es —r, the imputed estimator is 


|e = yee AY, 5 a d, Vk 9 
Then. Y¥,45 = Year; 80 Y, 


imp imp Has the same exposure to bias 
as Y.,,, as is easily understood: When the nonresponse 


(8.1) 


Statistics Canada, Catalogue No. 12-001-X 


138 Sarndal and Lundstrém: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias 


causes a skewed selection of y-values, the imputed values 
computed on that skewed selection will misrepresent the 
unknown y-values that characterize the sample s or the 
population U. 

Consider now the aspect (b) of the fit, that is, the 
hypothetical weighted LSQ regression fit to the data 
Ys y ) for k es. The peerosign coefficient vector would 
be By..g = = (Xs d,x,x,) 'X,d,X,y,, and the residuals 
Creed =V_ —X, By for kes ina Ds Fp Cys-q = 9. 
Using that ¥,d,m,x,/N=x,., and Y\d,m,y,/N= 


X\.aByy:a> We have 
AR = N" Yous — Yur) = (/N)>, A,MCp\..q- (8.2) 
Suppose the model is “true for the sample s”, with a 


perfect fit, so that e,..,; =0 forall k es. Then, by (8.2) we 
do have A, =0, so the nonresponse adjusted estimator 
Ya; agrees with the unbiased estimator Y,.,,. A belief that 
the bias is small hinges on an unverifiable assumption. 
Turning to the issue (1i), we now explain the essentially 
linear relation between the bias of Y.,, and the expected 
value of the indicator H, =A,/S, =(Yexp — Veoh Se 
For a given outcome (s,r), a fixed y-variable and a fixed 


x-vector we have 
Cron -Y)/NS, ae -Y)/NS, alae 

Let £,, denote the expectation operator with respect to 
all outcomes (s,7r), that is, E)= E(E,Cls)), where 
p(s) and q(r|s) are, respectively, the known sampling 
design and the unknown response distribution. We denote 
bias(eun = BA eg) ys ies V2) = BO) = Y 
and C=£,,,(NS\,). Using the usual large sample replace- 
ment of the expected value of a ratio by the ratio of the 
expected values, we have E,, [Con = Y/N S,]* 


(By, Weio= Y]/ E,,(NS, ) and analogously for Y,.p, so 


bias(Y,4, ) © bias(Y.x») — Cx E(H,). (8.3) 


Here bias(Y,.») and C do not depend on the choice of x- 
vector, whereas bias(¥.,, ) and E(H,) do. Therefore, as 
the x-vector changes, bias(Y.,,) and E(H,) are essen- 
tially linearly related. No particular forms of p(s) and 
q(r|s) need to be specified for (8.3) to hold. As a 
consequence, when two auxiliary vectors, x,, and x,,, are 
compared, the difference in bias is, to close approximation, 
proportional to the change in the expected value of H, : 


bias(Y.a1 (X,;,)) — bias(You, (X5,)) ¥ -C(E, — E,) (8.4) 


where E, = E,,(H)(x,)) for i=1, 2. The properties (8.3) 


and (8.4) are validated by the Monte Carlo study in 
Section 10. 
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Note that formula (8.3) does not guarantee that Yo, 
based on a certain vector x, will have zero or near-zero 
bias. It does not state that a comparatively large value of 
|A ,| guarantees a small bias in Y..,,.. What (8.3) says is that 
bias(Y.,,) is linearly related to the expectation of the 
indicator H, =A,/S,. Therefore, to assess available x- 
vectors in terms of the indicator H, (or the indicator 
H, =|A,|/S,) is consistent with the objective of bias 
reduction. — 

Turning to the issue (iii), we comment on the alternative 
method for selection of auxiliary variables proposed by 
Schouten (2007). His indicator for the step-by-step selection 
of variables differs from our indicators; it will usually not 
select exactly the same set of variables. In a list of say 30 
available categorical x-variables, the first ten to enter will 
not be the same set of ten as with our indicators H, to H,. 
The order in which variables are selected will not neces- 
sarily be the same either. For comparison, we compared, in 
some of our empirical work, with the variable selection 
realized by Schouten’s method. In some cases we noted a 
considerable congruence between the two sets of “first ten” 
picked in the two procedures. 

The differences between the two approaches are best 
appreciated by a comparison of their background and 
derivation. Our indicators H, and H, originate in the 
notion of separation (or distance), for a given outcome 
(s,r), between the adjusted estimator ve a, and the 
primitive one, Y..p, and in the idea that this separation will 
ordinarily increase when the x-vector becomes more 
powerful. The probability sampling design is taken into 
consideration; no assumptions are made on the response 
distribution. 

Schouten uses a superpopulation argument; sampling 
weights do not appear to enter into consideration. An 
expression for the model-expected bias of an estimator of 
the population mean is found to be proportional to the 
correlation (at the level of the population) between the y- 
variable and the 0-1 indicator for response. It is shown that 
this correlation (and consequently the bias) can be bounded 
inside an interval. In particular, the generalized regression 
estimator is considered and it is shown that its maximum 
absolute bias equals the width of the bias interval. This 
width depends on the true unknown regression vector B for 
the regression (at the population level) between y and x. 
This unknown 8 is replaced by an estimate based on the 
respondents, thus subject to some bias because of the 
nonresponse. Schouten emphasizes that a missing-at- 
random assumption is not needed for his method, which 1s 
in that respect similar to our method. 
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9. Auxiliary variable choice for the Swedish pilot 
survey on gaming and problem gambling 


We identified a real survey data set to illustrate the use of 
the indicators H,,H, and H, in building the x-vector. In 
2008, The Swedish National Institute of Public Health 
(Svenska Folkhdilsoinstitutet) conducted a pilot survey to 
study the extent of gambling participation and the charac- 
teristics of persons with gambling problems. Sampling and 
weight calibration was carried out by Statistics Sweden. We 
illustrate the use of the indicators in this survey, for which a 
stratified simple random sample s of n = 2,000 persons was 
drawn from the Swedish Register of Total Population 
(RTP). The strata were defined by the cross classification of 
region of residence by age group. Each of the six regions 
was defined as a cluster of postal code areas deemed similar 
in regard to variables such as education level, purchasing 
power, type of housing, foreign background. The four age 
groups were defined by the brackets 16-24; 25-34; 35-64 
and 65-84. 

The overall unweighted response rate was 50.8%. The 
nonresponse, more or less pronounced in the different 
domains of interest, interferes with the accuracy objective. 
An extensive pool of potential auxiliary variables was 
available for this survey, including variables in the RTP, in 
the Education Register and a subset of those in another 
extensive Statistics Sweden data base, LISA. For this 
illustration, we prepared a data file consisting of 13 selected 
categorical variables. Twelve of these were designated as x- 
variables, and one, the dichotomous variable Employed, 
played the role of the study variable. The values of all 
variables are available for all units k es. Response (k €1r) 
or not (k €s—r) to the survey is also indicated in the data 
file. 

Variables that are continuous by nature were used as 
grouped; all 12 x-variables are thus categorical and of the 
x, type, as defined in Section 2 (because most of the 
variables are available for the full population, they are 
potentially of the type eal but since the effect on bias is of 
little consequence, we used them as x, -variables). The 
study variable value, y, =1 if k is employed and y, =0 
otherwise, is known for kes, so the unbiased estimate 
Y,.,, defined by (3.2) can be computed and used as a 
reference. We also computed Y,,» defined by (3.1), as well 
as Y.,, defined by (2.5) for different x-vectors built by 
stepwise selection from the pool of 12 x-variables with the 
aid of the indicators H,, H, and H, defined by (5.12). 

We carried out forward selection as follows: The 
auxiliary vector in Step 0 is the trivial x, =1, and the 
estimator is Y.,p». In Step 1, the indicator value is computed 
for every one of 12 presumptive auxiliary variables; the 
variable producing the largest value of the indicator is 
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selected. In Step 2, the indicator value is computed for all 11 
vectors of dimension two that contain the variable selected 
in Step 1 and one of the remaining variables. The variable 
that gives the largest value for the indicator is selected in 
Step 2, and so on, in the following steps. A new variable 
always joins already entered variables in the “side-by-side” 
(or “+”) manner. Interactions are thereby relinquished. The 
order of selection is different for each indicator. 

The values of H, and H, that identify the next variable 
for inclusion are by mathematical necessity increasing in 
every step. This does not hold for H,. Ina certain step 7, we 
used the rule to include the x-variable with the largest of 
computed H, -values. That value can be smaller than the 
H,-value that identified the variable entering in the 
preceding step, j —1. The series of H, -values for inclusion 
will increase up to a certain step, then begin to decline, as 
Table 9.1 illustrates. 

The unbiased estimate is Y,., =4,265; the primitive 
estimate is Y.,, =4,719 (both in thousands). This suggests 
a large positive bias in Y,,,», whose relative deviation (in 
Oe OSaR DE — Won olan li ag xa 10. 
Adding categorical x-variables one by one into the x-vector 
will successively change this deviation, although when a 
few variables have been admitted, the change is not always 
in the direction of a smaller value. In each step we 
computed the indicator, Y.,, and RDF=(Yos, — Yeu, )/ 
ee x 10°. 

Table 9.1 shows the stepwise selection with the indicator 
H, (the number of categories is given in parenthesis for 
each selected variable). First to enter is the variable Income 
class; this brings a large reduction in RDF from 10.7 to 4.5. 
The next five selections take place with increased H, - 
values, and the value of RDF is reduced, but by successively 
smaller amounts. Step six, where Marital status is selected, 
brings about a turning point, indicated by the double line in 
Table 9.1: The value of H, then starts to decline, and Y.,, 
and RDF start to increase. At step 6, RDF is at its lowest 
value, 0.5, then starts to rise, illustrating that inclusion of all 
available x-variables may not be best. The turning point of 
H, and the point at which RDF is closest to zero happen to 
agree in this example. This is not generally the case. 
Moreover, in a real survey setting, RDF is unknown, as is 
the step at which RDF is closest to zero. 

Table 9.2 shows the stepwise selection with indicator 
H,. Its value increases at every step, but at a rate that levels 
off, and successive changes in Y.,, become negligible. 
This suggests to stop after six steps, at which point RDF = 
2.8. In none of the 12 steps does RDF come as close to zero 
as the value RDF = 0.5 obtained with H, after six steps. In 
this respect H, is better than 1, in this example. With all 
12 x-variables selected, RDF attains in both tables the final 
value 2.6. 
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Table 9.1 

Stepwise forward selection, indicator H,, dichotomous study 
variable Employed. Successive values of H,x10°, of You; in 
thousands, and of RDF = (You, — Yput)/Ypuy, x 107. For compar- 
ison, Ypxp x 10°= 4,7195 Ypy_x 10°= 4,265 


Auxiliary variable entered H, x10 ite, x10° RDF 
Income class (3) 76 4,458 45 
Education level (3) 107 4,350 2.0 
Presence of children (2) 114 4.326 1.4 
Urban centre dwelling (2) 118 4,310 II 
Sex (2) 123 4,296 0.7 
Marital status (2) 125 4,286 0.5 
Days unemployed (3) 121 4301 0.9 
Months with sickness benefits (3) 120 4,305 1.0 
Level of debt (3) 115 4,322 11.3 
Cluster of postal codes (6) 109 4,343 1.8 
Country of birth (2) 103 4.363 OFS 
Age class (4) 99 ARS i 2.6 


Table 9.2 

Stepwise forward selection, indicator H,, dichotomous study 
variable Employed. Successive values of Hy x10" , of Youn in 
thousands, of RDF = “(Yc AL ~Yrur)/ Yeu LX 10”. For comparison, 
Yoxp x 10°= 4,7193 Yeuy, x 10°= 4,265 


Auxiliary variable entered H,>10°' Y,.;°%10° "RDF 
Education level (3) 186 4,520 6.0 
Cluster of postcode areas (6) 250 4,505 5.6 
Country of birth (2) 281 4,498 Se 
Income Class (3) 298 4,369 2.4 
Age class (4) 354 4,399 ail 
Sexi) 364 4,384 2.8 
Urban centre dwelling (2) 374 4.378 2.6 
Level of debt (3) 381 4,364 D3 
Months with sickness benefits (3) 384 4.380 OT 
Presence of children (2) 387 4.379 Dy 
Marital status (2) 388 4,379 DG 
Days unemployed (3) 388 4,377 2.6 


The set of the first six variables to enter with H, has 
three in common with the corresponding set of six with H,. 
There is no contradiction in the quite different selection 
patterns, because H, is geared to the specific y-variable 
Employed, while H, is a compromise indicator, indepen- 
dent of any y-variable. To save space, the step-by-step 
results for indicator H, are not shown. Its selection pattern 
resembles more that of H, than that of H,. Out of the first 
six variables to enter with H,, four are among the first six 
with H,. As a general comment, we believe that in many 
practical situations the use of more than six variables is 
unnecessary, and the selection of the first few becomes 
crucially important. 


10. Empirical validation by simulation for a 
constructed population 


The theory presented in earlier sections makes no 
assumptions on the response distribution. It is unknown. 
The sampling design is arbitrary; its known inclusion 
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probabilities are taken into account. For the experiment in 
this section, we specify several different response distribu- 
tions with a specified positive value for the response proba- 
bility 6, for every k €U. That is, with specified proba- 
bility 0,, the value y, gets recorded in the experiment; 
with probability 1—0,, it goes missing. We find that the 
indicators H, (or H, =|H,|) defined in (5.11) ranks the 
different x-vectors in the correct order of preference for all 
participating response distributions, consistent with the 
theoretical results (8.3) and (8.4). We confirm that, over a 
long run of outcomes (s,7r), the average of H, = 
A,/S,=—R,,, Xev,, tracks the bias of the calibration 
estimator, measured by the average of Y.,, —Y, in an 
essentially perfectly linear manner, when the x-vector 
moves through 16 different formulations. We also examine 
the indicators H, and H, defined in (5.12), and find in this 
experiment that they also have strong relationship to the bias 
i reas 

We experimented with several created populations; the 
conclusions were similar. We report here results for one 
constructed population of size N =6,000, with created 
values (y,,X,,9,) for k=1,2,...,N=6,000, for 16 
alternative categorical formulations of x,, and four 
different ways to assign the 0,. 

The 16 alternative categorical auxiliary x-vectors were 
obtained by grouping the generated values x,, and x,, of 
two continuous auxiliary variables, x, and x,. The values 
(V5 ap ep). for A = 1) 2727, 65000 7 were created aimaitce 
steps as follows. Step 1 (the variable x, ): The 6,000 values 
X,, were obtained as independent outcomes of the gamma 
distributed random variable I(a, b) with parameter values 
a= 2, b=5. The mean and variance of the 6,000 realized 
values x,, was 10.0 and 49.9, respectively. Step 2 (the 
variable x,): For unit A, with value x,, fixed by Step 1, a 
value x,, is realized as an outcome of the gamma random 
variable with parameters such that the conditional expec- 
tation and variance of x,, are a+Bx,+Kh(x,) and 
O° x,, respectively, where h(x,) = X4 (ty — tee) 
(x, —3H,,) with p, = = 10. We used the values a =1, 
(ee 0.001 and o2=25. The polynomial term 
Kh(x,,) gives a mild non-linear shape to the plot of 
(x5,, %,,), to avoid an exactly linear relationship. The mean 
and variance of the 6,000 realized values x,, were 11.0 and 
210.0, respectively. The correlation coefficient between x, 
and x,, computed on the 6,000 couples (x, x5, ), Was 
0.48. Step 3 (the study variable y): For unit k, with values 
X,, and x,, fixed by Steps 1 and 2, a value y, is realized 
as an outcome of the gamma random variable with 
parameters such that the conditional expectation and 
variance of y, are cy+cox,+¢,x, and of (cx, + 
CX, ), respectively. We used c, =1,¢, =0.7, c, =0.3 and 
6, =2. The mean and the variance of the 6,000 realized 
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values y, were 11.4 and 86.5, respectively. The correlation 
coefficient between y and x,, computed on the 6,000 
couples (),,%,), was 0.76; that between y and x,, 
computed on the 6,000 couples (,, x5, ), was 0.73. 

Each of the two x-variables was then transformed into 
four alternative group modes, denoted 8G, 4G, 2G and 1G, 
yielding 4x 4=16 different auxiliary vectors x,. The 
6,000 values x,, of variable x, were size ordered; eight 
equal-sized groups were formed. Group | consists of the 
units with the 750 largest values x,,, group 2 consists of the 
next 750 units in the size ordering, and so on, ending with 
group 8. In this mode 8G of x,, unit & is assigned the vector 
value Y;,,.s),, Of dimension eight with seven entries “0” and 
a single entry “1” to code the group membership of &. Next, 
successive group mergers are carried out, so that two 
adjoining groups always define a new group, every time 
doubling the group size. Thus for mode 4G, the merger of 
groups | and 2 puts the units with the 1,500 largest x,, - 
values into a first new group; groups 3 and 4 merge to form 
the second new group of 1,500, and so on; the vector value 
associated with unit k is ,,..4),. In mode 2G, unit k has the 
vector value y,,.», =(1,0)’ for the 3,000 largest x, -value 
units and y,, 5), =(0,1)' for the rest. In the ultimate mode, 
1G, all 6,000 units are put together, all x, -information is 
relinquished, and y,,.,,, =1 for all A. The 6,000 values x,, 
were transformed by the same procedure into the group 
modes 8G, 4G, 2G and 1G. Corresponding group member- 
ship of unit & is coded by the vectors Y,,..s)¢5 Y¢x,:4)k> Vay:2)k 
and ¥,,., =!. The 4x 4= 16 different auxiliary vectors 
x, take into account both kinds of group information; the 
two y-vectors are placed side by side (as opposed to 
crossed), the result being a calibration on two margins, as 
indicated by the “+” sign. Thus for the case denoted 
8G+8G, unit & has the auxiliary vector value x, = 
(Y(x,:8)k* Vixy:8)k (1 Where (—1) indicates that one category 
is excluded in either y(,..3), OF Y,x,.8). to avoid a singular 
matrix in the computation, giving x, the dimension 8 + 8 — 
1=15. The case 8G+8G has the highest information 
content. At the other extreme, the case 1G + 1G disregards 
all the x-information and x, =1 for all k. There are 14 
intermediate cases of information content. For example, 
4G + 2G has xX, = (¥(,-4)e°V¥oq:2)k (1) Of dimension 4 + 2 — 
1=5;4G+1G has x, =(¥(x-40 Dit) = Vex sae OF dimen- 
sion 4 (there is non-negligible interaction between x, and 
x, in this experiment, but we restrict the experiment to x- 
vectors without interactions, causing no risk of small group 
counts). 

We discuss here the results for four response distri- 
butions. Their response probabilities 0,, A =1, 2,...,N = 
6,000, were specified as follows: 


IncExp(10 + x,+ x,), with 0, =1 — ptm +24) 
where c = 0.04599 
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IncExp(10 + y), with 0, =] — ee lO+8) 
where c = 0.06217 
with Ope ert ee? 
where c = 0.01937 
with Oper 
where c = 0.03534. 


DecExp(caerx,)} 


DecExp( y), 


The constant c was adjusted in all four cases to give a 
mean response probability of 6,, =X, 9, /N = 0.70. In the 
first two, the value 10 (rather than 0) was used to avoid a 
high incidence of small response probabilities 0,. These 
four options represent contrasting features for the response 
probabilities: increasing as opposed to decreasing, de- 
pendent on x-values only as opposed to dependent on y- 
values only. In the second and fourth option, the response is 
directly y-variable dependent, and could hence be called 
“purely non-ignorable”’. 

We generated J =5,000 outcomes (s,7), where s of 
size n=1,000 is drawn from N= 6,000 by simple random 
sampling and, for every given s, the response set r is 
realized by each of the four response distributions. That is, 
for kes, a Bernoulli trial was carried out with the 
specified probability 8, of inclusion in the response set r. 
The Bernoulli trials are independent. 

For each response distribution, for each of the 16 x- 
vectors, and for every outcome (s,7), we computed the 
relative deviation RD =(¥.,, —Y)/Y, where Y.,, is given 
by (2.4) and Y =, y, is the targeted )-total, known in this 
experimental setting (alternatively, we used Ve aL given by 
(2.5) but, as expected, the difference in bias compared with 
Yoa, is negligible). We also computed the indicators 
H,,i=0,1,2,3, given by (5.11) and (5.12). Summary 
measures were computed as 


J 
relbias = Av(RD) == )RD; 
yall 


uf 
AVH,)=— oH, for G= 0.1/2.3 

j=l 
where / indicates the value computed for the j" outcome, 
j=1,2,...,5,000=J. For each response distribution, we 
thus obtain the value re/bias (which is the Monte Carlo 
measure of the relative bias (E,, (Yea) — Y WV eands 16 
values of Av(H,) (which is the Monte Carlo measure of 
Ba, )), 7=0,1, 2,3, where p stands for simple random 
sampling, and q stands for one of the four response 
distributions. 

Table 10.1 shows, for IncExp(10 + x, +x, ), relbias in % 
and Av(H,)x10° for the 16 x-vectors. For the cell 1G + 
1G, with vector x, =1, all four Av-quantities are zero, and 
relbias is at its highest level, 13.2%. At the opposite 
extreme, the cell 8G+8G represents the highest level of 
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information; it gives the highest value for Av(H,), and 
relbias is at its lowest value, 0.2%; virtually all bias is 
removed (except for a possible sign difference, Av(H,) 
and Av(H,) were equal for all cells). 

The result (8.4), holding for any response distribution 
and any sampling design, states that the indicator H, will 
rank the 4x4=16 auxiliary vectors correctly for any 
response distribution (with response probabilities not all 
constant, as noted below). Table 10.1 illustrates (8.4) in 
terms of H, =|H,|: The change, from any one cell to any 
other, in the value of Av(#7,) (the Monte-Carlo estimate of 
the expected value of (H,) is accompanied by a pro- 
portional change in the value of relbias. The same 
proportionality was noted for the other three response 
distributions. We could have chosen other response 
distributions to illustrate the same property. 


Table 10.1 

Relbias in % and, within parenthesis, the value of Av(H,) x 10° 
for 16 auxiliary vectors x,. Response distribution IncExp (10 + 
X, + Xz) 


Groups Groups based on x3, 

based on 
X14, 8G 4G 2G 1G 
8G O27 “COL Osea Oo) iS (OS) has 1G) 
4G Olsen 98) 60Oi (OG) mn le oman (39) etal 0) 
2G ES OL) IES (8S) ie 5-2 ,9) OSS (2) 
1G AN 70) 0 OF) S46) Ss 2 (0) 


The response distribution with a constant response 
probability 0, for all k is a special case. The calibration 
estimator Y.,, based on any vector x, then has zero bias 
(very nearly), and this includes the primitive estimator Y,., 
with x, =1. Result 8.3 continues to be valid, stating in that 
case that E,,(H))* bias(Yu,, ) = bias(Yexp) 0. In the 
context of the simulation in this section, if 8, = 0.70 for all 
k is taken to be an additional response distribution, Table 
10.1 will in all 16 cells show nearly zero values of both 
relbias in % and Av(H,)x10°, from the weakest cell 
(1G + 1G) all the way to the cell of the most powerful x- 
vector (8G + 8G). There is no bias to be removed by an 
improvement of the x-vector. If in practice the indicator 
(H,) does not react to an enlargement of the x-vector, there 
is no incentive to seek beyond the simplest vector 
formulation. It could signify one of three possibilities: The 
y-variable in question is not subject to nonresponse bias, or 
that the response probability is almost constant, or that none 
of the available x-vectors is capable of reducing an existing 
bias. 

To save space we do not show the corresponding tables 
for Av(H,) and Av(H,). By mathematical necessity, both 
quantities increase in the nested transitions. Not shown 
either are the counterparts of Table 10.1 for the other three 
response distributions. The patterns are similar. 
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Table 10.2 for IncExp(10+.x,+x,) and Table 10.3 for 
IncExp(10 + y) show how Av(H,), Av(H,) and Av(H,) 
rank the 16 x-vectors, represented by their value of relbias. 
To measure the success of ranking, we computed the 
Spearman rank correlation coefficient, denoted rancor, 
between re/bias and the value of the indicator, based on the 
16 values of each. For Av(H,), the bottom line of the two 
tables shows |rancor|=1, for perfect ranking. For these 
data, |rancor| is near one also for Av(H,) and Av(H,) 
(more generally, the ranking obtained with H, and H, 
may be good, but is data dependent). 


Table 10.2 

Value, in ascending order, of relbias in %, and corresponding 
value and rank of Av(H,)x10°, Av(H,)x10° and 
Av(H,)x10° , for 16 auxiliary vectors. Bottom line: Value of 
Spearman rank correlations, rancor. Response distribution 
IncExp (10 + x, + x) 


relbias  Av(H,)x10° Av(H,)x10° Av(H)x 10° 
0.2 101 (1) 17 (1) 232 (1) 
0.5 99 (2) 119 (2) 225 (2) 
0.5 98 (3) 118 (3) 224 (3) 
0.8 6 (4) 109 (4) ost) (4) 
13 93 (5) 109 (5) 216 (5) 
1.5 91 (6) 105 (6) 213 (6) 
1.8 89 (7) 98 (7) 207 (7) 
1.9 88 (8) 94 (8) 205 (8) 
3.2 78 (9) 80 (11) 192 (9) 
3.4 16 (10) 90 (9) 188 (11) 
4.1 70 (11) 84 (10) 190 (10) 
4.1 70 (12) 77 (12) 175 (13) 
5.0 64 (13) 70 (13) 179 (12) 
6.4 52 (14) 52 (14) 146 (15) 
75 46 (15) 46 (15) 156 (14) 


13.2 0 (16) 0 (16) 0 (16) 
Rancor -1.00 -0.99 -0.99 


There is one notable contrast between the results on 
relbias for the two response distributions in Tables 10.2 and 
10.3. The best among the auxiliary vectors leave consid- 
erably more bias for the non-ignorable IncExp(10 + y) than 
for IncExp(10+x,+.x,). This is not unexpected, and it is 
important to note that considerable bias reduction is 
obtained for the non-ignorable case as well. 

In the simulation, the over-adjustment mentioned in 
Section 4, A, >A, >0 (when (Y,xp) has positive bias) or 
A ,<A,;<0 (when Y,,) has negative bias), happens for 
some outcomes (s,7). The frequency varies with the 
strength of the auxiliary vector and is different for different 
response distributions. The cell for which this over- 
adjustment is most likely to occur is 8G+8G, the most 
powerful of the 16 auxiliary vectors. For IncExp(10+ x, + 
x,), the bias is almost completely removed for cell 
8G + 8G; relbias is only 0.2%. Hence Y.,, is close to the 
unbiased Y,,,,,A, isnear A,, and A, >A, happened for 
45.6% of all outcome (s,r). By contrast, for the non- 
ignorable case IncExp(10 + y), the incidence of A, >A, 
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was only 0.1% for the cell 8G+8G. Although that cell 
brings considerable bias reduction (compared to the 
primitive 1G+1G), there is bias remaining, and as a 
consequence, A, > A, almost never happens. 

We do not show the corresponding tables for 
DecExp(x,+x,) and DecExp(y). The lowest value of 
rancor was 0.94, recorded for Av(H,) in the case of 
DecExp( x, + x, ). 

A question not addressed in Tables 10.2 and 10.3 is: 
How often, over a long series of outcomes (s,7), does a 
given indicator H(x,) succeed in pointing correctly to the 
preferred x-vector? To answer this, let x,, and x,, be two 
vectors selected for comparison. If the absolute value of the 
bias of Yo, (X,,) is smaller than that of Yo, (x,,), we 
would like to see that H(x,,)2H(x,,) holds for a vast 
majority of all outcomes (s,7), because then the indicator 
H(:) delivers with high probability the correct decision to 
prefer x,,. Because H(x,) has sampling variability, its 
success rate (the rate of correct indication) depends on the 
sample size, and we expect it to increase with sample size. 


Table 10.3 

Value, in ascending order, of relbias in %, and corresponding 
value and rank of Av(H,) x 10°, Av(H,) x 10° and Av(H,)x 
10°, for 16 auxiliary vectors. Bottom line: Value of Spearman 
rank correlations, rancor. Response distribution IncExp (10+ y) 


relbias Av(H,)x10° Av(H,)x10°— Av(H)x10° 
3.6 74 (1) 91 (1) 165 (1) 
3.9 a (2) 84 (2) 158 (2) 
4.0 71 (3) 83 (3) 156 (3) 
43 68 (4) 76 (5) 149 (5) 
4.4 68 (5) 78 (4) 153 (4) 
49 64 (6) 68 (8) 142 (3) 
4.9 63 (7) 72 (6) 146 (6) 
53 60 (8) 69 (7) 143 (7) 
54 60 (9) 64 (9) 137 (9) 
6.0 55 (10) 59 (10) 132 (10) 
6.2 53 (11) 54 (11) 128 (11) 
g2 46 (12) 54 (12) 122 (12) 
79 41 (13) 41 (14) 111 (13) 
7.9 40 (14) 43 (13) 109 (14) 
9.6 27 (15) og (15) 90 (15) 
13.1 0 (16) 0 (6) 0 (16) 
Rancor -1.00 -0.99 -0.99 


We threw some light on this question by extending the 
Monte Carlo experiment: 5,000 outcomes (s,7) were 
realized, first with sample size n = 1,000, then with sample 
size n= 2,000 (the response set r is realized according to 
one of the four response distributions, declaring unit k 
“responding” as a result of a Bernoulli trial with the 
specified probability 8, ). We computed the success rate as 
the proportion of all outcomes (s,7) in which the correct 
indication materializes in a confrontation of two different x- 
vectors. Several pairwise comparisons of this kind were 
carried out. Typical results are shown in Table 10.4, for 
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IncExp(10 +x, +x,). The upper entry in a table cell shows 
the success rate in % for n=1,000, the lower entry shows 
that rate for n = 2,000. Shown in parenthesis is the value of 
relbias for the vectors in question. 

“Severe tests” are preferred, that is, confrontations of 
vectors with a small difference in absolute re/bias, because 
the correct decision is then harder to obtain. There is a priori 
no reason why one of the indicators should always 
outperform the others in this study. In the five severe tests in 
Table 10.4, H, has, on the whole, better success rates than 
H, and H,. The success rate of H, improves by doubling 
the sample size, and tends as expected to be greater when 
the re/bias values are further apart. The case 4G + 8G vs. 
8G+8G compares nested x-vectors, so it is known 
beforehand that H, and H, give perfect success rates. 


Table 10.4 

Selected pairwise comparisons of auxiliary vectors; percentage of 
outcomes with correct indication, for the indicators H,,H, and 
H,. Within parenthesis, relbias in %. Upper entry: n = 1,000 
lower entry: n=2,000. Response distribution IncExp(10+ 
X1 +X) 


Cells compared Percent outcomes with correct indication 


H, H, H, 
4G + 8G(0.5) vs. 90.0 100.0 100.0 
8G + 8G(0.2) 96.4 100.0 100.0 
4G + 2G(1.8) vs. 66.8 86.0 ZOg 
2G + 8G(1.5) 74.2 89.0 67.4 
1G + 8G(4.1) ws. 743 70.3 45.0 
8G + 1GG.4) 82.8 78.0 43.3 
4G + 1G(4.1) vs. 90.6 614 83.9 
Ga GCo) 97.0 68.8 92.3 
1G + 2G(7.3) vs. 774 774 34.5 
2G + 1G(6.5) 85.9 85.9 28.8 


11. Concluding remarks 


In this article, we address survey situations where many 
alternative auxiliary vectors (x-vectors) can be created and 
considered for use in the calibration estimator Y.,, . For any 
given x-vector, a certain unknown bias remains in Y..,,; we 
wish by an appropriate choice of x-vector to make that bias 
as small as possible. Hence we examine the bias ratio 
defined by (4.2) and (4.3). The component A, of the bias 
ratio was expressed, in (5.8) to (5.10), as product of easily 
interpreted statistical measures. This led us to suggest 
several alternative bias indicators, for use in evaluating 
different x-vectors in regard to their capacity to effectively 
reduce the bias. We studied in particular the indicator H, 
given by (5.12). It functions very well but is geared to a 
particular study variable y. However, a typical government 
survey has many study variables, and for practical reasons it 
is desirable to use the same x-vector in estimating all )- 
totals. A compromise becomes necessary. We argued that 
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the indicator H, in (5.12) suits this purpose; it depends on 
the x, but not on any y-data. A topic for further research is 
to develop other indicators (than H,) for the “many )- 
variable situation”. Another topic for further work is to 
examine algorithms for stepwise selection of x-variables 
with the indicator H,, other than the one used in Section 9. 
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Calibration estimation using exponential tilting in sample surveys 


Jae Kwang Kim | 


Abstract 


We consider the problem of parameter estimation with auxiliary information, where the auxiliary information takes the form 
of known moments. Calibration estimation is a typical example of using the moment conditions in sample surveys. Given 
the parametric form of the original distribution of the sample observations, we use the estimated importance sampling of 
Henmi, Yoshida and Eguchi (2007) to obtain an improved estimator. If we use the normal density to compute the 
importance weights, the resulting estimator takes the form of the one-step exponential tilting estimator. The proposed 
exponential tilting estimator is shown to be asymptotically equivalent to the regression estimator, but it avoids extreme 
weights and has some computational advantages over the empirical likelihood estimator. Variance estimation is also 
discussed and results from a limited simulation study are presented. 


Key Words: Benchmarking estimator; Empirical likelihood; Instrumental variable calibration; Importance sampling; 


Regression estimator. 


1. Introduction 


Consider the problem of estimating Y= >‘,y, for a 
finite population of size N. Let A denote the index set of 
the sample obtained by a probability sampling scheme. In 
addition to y,, suppose that we also observe a p- 
dimensional auxiliary vector x, in the sample such that 


X= )),x, is known from an external source. We are 
interested in estimating Y using the auxiliary information 


X. 
The Horvitz-Thompson (HT) estimator of the form 
Y, =e (1) 
icA 


where d;, =1/n, is the design weight and 1, is the first 
order inclusion probability, is unbiased for Y. But, it does 
not make use of the information given by X. According to 
Kott (2006), a calibration estimator can be defined as the 
estimator of the form 


Ys a > W; Vi 


icA 
where the weights w, satisfy 


i kX. (2) 
icA 
and ne is asymptotically design unbiased (ADU). Cali- 
bration estimation has become very popular in survey 
sampling because it provides consistency across different 
surveys and often improves the efficiency. (Sarndal 2007). 
The regression estimator, using the weights 


=] 


bite Ue (3) 


l 1 


w, =d, +(X-K,)(L4 X;X) 
JE 


obtained by minimizing 


> (w, -d,)/d, 

icA 
subject to constraint (2), is asymptotically design unbiased. 
Note that if an intercept term is included in the column 
space of X matrix then (2) implies that the population size 
N is known. If N is unknown, one can require that the 
sum of the final weights are equal to the sum of the design 
weights. Thus, 

yw, =N, (4) 


icA 
where 


: N if N is known 
Ms >\d, otherwise, 
icA 
can be imposed as a constraint in addition to (2), which 
yields the weights 


A N : U 
W; opal sla [x-5,| 
Na d 


= = xl 
2 d,(x,—X,) (%,- x.) ds(a23X5); (5) 
where X, =Yiesd,X;, N,=Yiead, and X,=X,/N,. 
We define the regression estimator to be ix = ics, J; 
using the weights (5). The regression estimator can be 
efficient if y, is linearly related with x, (Isaki and Fuller 
1982; Fuller 2002), but the weights in the regression 
estimator can take negative or extremely large values. 
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The empirical likelihood (EL) calibration estimator, 
discussed by Chen and Qin (1993), Chen and Sitter (1999), 
Wu and Rao (2006), and Kim (2009), is obtained by maxi- 
mizing the pseudo empirical likelihood 

> d, In(w,) 
ic¢A 
subject to constraints (2) and (4). The solution to the opti- 
mization problem can be written as 
l 


(9 (6) 
Ay +A; (x; — X/N) 


where 2, and A, satisfy constraints (2), (4), and w, >0 for 
all i. The EL calibration estimator is asymptotically 
equivalent to the regression estimator using weights (5) and 
avoids negative weights if a solution exists, but can result in 
extremely large weights. 

Because the empirical likelihood method requires solving 
nonlinear equations, the computation can be cumbersome. 
Furthermore, in some extreme cases, X = N ays op. & does 
not belong to the convex hull of the sample x,’s and the 
solution does not exist. In this extreme situation, the con- 
straint (2) can be relaxed. 

Rao and Singh (1997) solved a similar problem by 
allowing 


B pal ROE Ome oD 


De WiXy —X; 


ic A 


ces 


for some small tolerance level 6, >0 where X,, = eae 
Note that the choice of 6, =0 leads to the exact calibration 
condition (2). Rao and Singh (1997) chose the tolerance 
level 6, using a shrinkage factor in the ridge regression but 
their approach does not directly apply to the empirical 
likelihood method and the choice of 6, is somewhat 
unclear. Chambers (1996) and Beaumont and Bocci (2008) 
also discussed a ridge regression estimation in the context of 
avoiding extreme weights. Breidt, Claeskens and Opsomer 
(2005) used penalized spline approach to obtain the ridge 
calibration. Recently, Park and Fuller (2009) developed a 
method of obtaining the shrinkage factor 6, using a 
regression superpopulation model with random components. 

Chen, Vartyath and Abraham (2008) tackled a similar 
problem in the context of the empirical likelihood method 
and proposed a solution by adding an artificial point such 
that X= N'>*. x, would belong to the convex hull of the 
augmented x,’s. The proposed estimator in Chen ef al. 
(2008) only satisfies the calibration property approximately 
in the sense that 


>, x, -X=o0, (i NY. (7) 


icA 
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This approximate calibration property is attractive because it 
allows more generality in the choice of weights. In 
particular, when the dimension of the auxiliary variable x is 
large the calibration constraint (2) can be quite restrictive. 
As can be seen in Section 2, an estimator satisfying the 
asymptotic calibration property (7) enjoys most of the 
desirable properties of the empirical likelihood calibration 
estimator and is computationally efficient. 

In this paper, we consider a class of empirical-likelihood- 
type estimators that satisfy the approximate calibration 
property (7). In Section 2, the idea of estimated importance 
sampling of Henmi ef al. (2007) is discussed and a new 
estimator using this methodology is proposed. In Section 3, 
a weight trimming technique to avoid extreme calibration 
weights is proposed. In Section 4, variance estimation of the 
proposed estimator is discussed. In Section 5, results from a 
simulation study are presented. Concluding remarks are 
made in Section 6. 


2. Proposed method 


To introduce the proposed method, we first discuss 
estimated importance sampling introduced by Henmi ef al. 
(2007). Suppose that x, is observed throughout the popu- 
lation but y, is observed only in the sample. We assume a 
superpopulation model for x, with density f(x; 9) known 
up to a parameter y € 2. The superpopulation model char- 
acterized by the density f(x; 9) is a working model in the 
sense that the model is used to derive a model-assisted 
estimator (Sarmndal, Swenson and Wretman 1992), 

Let m be the pseudo maximum likelihood estimator of 
yn computed from the sample 

H= arg ge d, Int f(x,5 n)} 
and let m, , be the maximum likelihood estimator of y 
computed from the population 


N 
Moy = arg max » In{ f(x,;m)}. 
i=l 


Following Henmi etal. (2007), we can construct the 
following estimated importance weight 


see LO 5Moy) 
(x1) 


To discuss the asymptotic properties of the estimator 
using the weights in (8), assume a sequence of the finite 
populations and the samples, as in Isaki and Fuller (1982), 
such that 


(8) 
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D4 (0,4)! &Y)- Dex (x},¥,) =O, (07 N 

icA 
for all possible A and for each N. The following theorem 
presents some asymptotic properties of the estimator with 
the estimated importance weights in (8). 


Theorem |. Under the regularity conditions given in Appen- 
dix A, the estimator Y= DiesW,¥;, with the w, defined by 
(8), satisfies 


VnN(¥, -¥,)=0, (I), (9) 
where 
¥, =¥,-Xi, By Sow (10) 


y is defined i ue Cy So4= Sid's) SNS ads 
and a SUC NEA NGL ess Je Ea gok Oln f(x,;1)/0n) Si 
and ey notation B® denotes BB’. 


The proof of Theorem | is presented in Appendix A. 
Because Sy = >/1)S,. =0, we can write (10) as 


iv = % ae Pes. (Sov 5 Soy ), 


which is a regression estimator of Y using $,(1,),) as the 
auxiliary variable. Therefore, under regularity conditions, 
the proposed estimator using estimated importance sam- 
pling is asymptotically unbiased and has asymptotic vari- 
ance no greater than that of the direct estimator Y,. Note 
that the validity of Theorem | does not require that the 
working model f(x; n) be true. 

If the density of x, is a multivariate normal density, then 
the weights in (8) become 


xx, N ) 


A ‘| 
Ln d ) 


where X,, is ial Te Si aa=Dies 4; (X; Xia "IN 4s 
Ley = r(x, - Xoalie "IN, and (x; p,2) is the density 
of the multivariate Saeea distribution with mean p and 
variance-covariance matrix LZ. If LZ, y is unknown and 
only X,, is available, then we can use 


pig 2g a) 
eels a) 


(11) 


—o- 
= 
ta 
Al) I 
QQ <= 
M 


(12) 


Tillé (1998) derived weights similar to those in (12) in the 
context of conditional inclusion probabilities. 

In general, the parametric model for x, is unknown. 
Thus, we consider an approximation for the importance 
weights in (8) using the Kullback-Leibler information 
criterion for distance. Let f(x) be a given density for x 
and let P, be the set of densities that satisfy the calibration 
constraint. That is, 
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hy = { foo: | f@dx =aih [xf(xydx = xe 


The optimization problem using Kullback-Leibler distance 
can be expressed as 


ae gon h(0)In) ®t (13) 
The solution to (13) is 
fy) = fa (14) 
Efexp(i'x)} 


where 2 satisfies [x fo(x)dx =X,,.. Thus, the estimated 
importance weights in (8) using the optimal density in (14) 
can be written 


w, =d, Jol%i) ) 
x) 


where i and rs satisfy constraint (2) and (4). The shift 
from f(x) to f)(x) in (14) is called exponential tilting. 
Thus, an estimator using the weight (15) satisfying the cali- 
bration constraints (2) and (4) can be called an exponential 
tilting (ET) calibration estimator. That is, we define the ET 
calibration estimator as 


= d.exp(A, + Mx,) (15) 


Ver = Ds a, exp (Ag = MX,) Yip (16) 
ieA 

where Me and ‘ satisfy constraint (2) and (4). Estimators 
based on exponential tilting have been used in various 
contexts. For examples, see Efron (1981), Kitamura and 
Stutzer (1997), and Imbens (2002). When WN is known, 
Folsom (1991) and Deville, Sarndal and Sautory (1993) de- 

veloped the estimator (16) using a very different approach. 
To compute A, and A, in (16), because of the cali- 
bration constraints (2) and (4), we need to solve the follow- 

ing estimating equations: 
Uy (0) = >) d,exp(Ayp + ¥x,)-N=0 (17) 
icA 


Uia= >) 4, exp (Ay + 4X; x, 


icA 


-X=0, (18) 


where 2'=(A,, 41). Writing U’=(U,, Ul), we can use 
the Newton-type algorithm of the form 


ie : Fi oerats | tie ee 
ee hee “|= UA, } UQi) 
and the solution can be written 

Rees). 


I(t+1) \(t) 


abs 


icA 


== @2 
Wi (X; — Xwry) \" i Dy Mn® x; (19) 
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where w,,,= d; EXP (Agayt MenX ;) and Xya= Dies Wig) X;/ 
Dies)» With the initial values ie =O) Once (2770 is 
computed by (19), er is computed by 


ere 
De Adi exp (Aty)X;) i 


Note that, wo, =d NIN , since hee, =0. Because U(A) 
is twice continuously differentiable and convex in A, the 
sequence ry ,», always converges if the solution to U(A) =0 
exists (Givens and Hoeting 2005). The convergence rate is 
quadratic in the sense that 


Die a CD). 


I(t) 


exp (A (20) 


oty) 


- 
iC hp 


for some constant C, where is = LIM yay 5: Ayeyy: 
By construction, the ¢-step exponential tilting (ET) esti- 
mator, defined by 
Yer = DS: d; exp (hoc) + 


ieA 


tox 


100) Xx); (21) 


where ee and Nes are computed by (19) and (20), 
satisfies the calibration constraint (2) for sufficiently large 
t. By the recursive form in (19) with ,,5,=0, we can 


write 
t—l 


A ae a 
Mar) a oe (Su jy) (Xy = Xi j)s (22) 


J=9 


where X,, =X/N and Sx wy) = Lies Wigy (X; =X.) IN. 
Thus, the ¢ -step ET estimator (21) can be written as 


oe FSi; 


Yerw) =N d > 
Dats iSi(t) 
where 
et o(x;,; Xy; Sa w(/) ) 
Sit) — = 
j=0 | X;; X, j) Sis w(j) 


The following theorem presents some asymptotic 
properties of the exponential tilting estimator. 


Theorem 2. The t-step ET estimator (21) based on 
equations (19) and (20) satisfies 


VN Yoru ~ Yur) = Op), (23) 
..., Where ii 


for each t=1, 2, reg IS the regression estimator 
using the regression weight in (5). 


The proof of Theorem 2 is presented in Appendix B. 
Theorem 2 presents the asymptotic equivalence between the 
t -step ET estimator and the regression estimator. Unlike the 
regression estimator, the weights of the ET estimator are 
always positive. For sufficiently large ¢, the ¢-step ET 
estimator satisfies the calibration constraint (2). Deville and 
Sarndal (1992) proved the result (23) for the special case of 
b> oo; 


Remark 1. The one-step ET estimator, defined by Yoon 
has a closed-form tilting parameter 
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a = > d(x; Fi X, ye 


icA 


/ Nal Sy Sr RACES 


where xe =X/N and Ke kd wee 
Theorem 2, the one-step ET estimator is asymptotically 
equivalent to the regression estimator, but the calibration 
constraint (2) is not necessarily satisfied. Using Theorem 2 
applied to x, instead of y,, the one-step ET estimator can 
be shown to satisfy the approximate calibration constraint 
described in (7). 


Remark 2. The ET estimator can also be derived by finding 
the weights that minimize 


200) =F tl “| (25) 
cA d, 
subject to constraints (2) and (4). The objective function 
(25) is often called the minimum discrimination function. 
The minimum value of Q(w) is zero if (4) is the only 
calibration constraint and is monotonically increasing if 
additional calibration constraints are imposed. 


3. Instrumental-variable calibration 


We consider some extension of the proposed method in 
Section 2 to a more general class of ET calibration estimator 
using instrumental-variables. Use of instrumental-variable 
in the calibration estimation has been discussed in Estevao 
and Sarndal (2000) and Kott (2003) in some limited 
simulations. Let z,=z(x,) be an instrumental-variable 
derived from x,, where the function z(-) is to be 
determined. The instrumental-variable exponential tilting 
(VET) estimator using the instrumental variable z, can be 
defined as 

Yiver = Wi =D expo +2iz,)¥, (26) 

icA ieA 

where ie and 1 are computed from (2) and (4). Note that 
the IVET estimator (26) is a class of estimators indexed by 
z,. The instrumental-variable approach defined in (26) 
provides more flexibility in creating the ET estimator. The 
choice of z, =x, leads to the standard ET estimator in (16) 
but some transformation z, = z(x,) can make the resulting 
ET estimator in (26) more attractive in practice. The 
solution to the calibration equations can be obtained 
iteratively by 


e ay bi 
Mos = ots Win OG — Xin) Zi Zon) 
icA 
is 2 Minx ‘|, (27) 
where wy, =d, EXP (Aocy + Mey z,) and Te =DicaW, Wy yj! 


ied Wier) with equation (20) unchanged and i,,o, = 0. 
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The IVET estimator (26) is useful in creating the final 
weights that have less extreme values. Since the final weight 
in (26) is a function of z,, we can make g, =w//d, 
bounded by making z, bounded. To create bounded z,, we 
can use a trimmed version of x;, noted by z, =(z,, 
Zin s+» Zip ), where 


xij if |x, —x,|S CS, 
Fes GF te C,S; if Xy > xX, + CS; (28) 
OIE 29 TON. 


on J 
Re eek Ss Nd (se e,) , and C, is 
a threshold for detecting outliers, for example, C ; =3. Thus, 
the [VET estimator using the instrumental-variable obtained 
by trimming x, can be used as an alternative approach to 
weight trimming. 
Instead of using the trimmed instrumental variable z, in 
(28), we can consider the following instrumental variable 


Z,= x,®, 


for some symmetric matrix ®, such that z, is bounded. 
Some suitable choice of ®, can also improve the efficiency 
of the resulting IVET estimator. To see this, using the same 
argument from Theorem 2, the instrumental-variable ET 
estimator (26) using equations (20) and (27) is asymptotically 
equivalent to 


Yy, reg Ge Y, Fe (P. X,)'B, (29) 

where 
N ae 
(X), Y,)=| = | (X), %) 
d d Fa i) 1 
and 
ie 7 yf) = 
B, ={24(@) ~ 24) Xa) ¥ d(z,-Z,)y,. G0) 
ied ieA 


The estimator (29) takes the form of a regression estimator 
and is called the instrumental-variable regression estimator. 
Thus, under the choice of z,=@,x,, the instrumental- 
variable regression estimator can be written as (29) with 


=! 
B. = iz d(x; ra X, )®, (x; A x.) i d(x; im X,)O; VA 
icA icA 
and its variance is minimized for ®, =V,' where V, is the 
model-variance of y, given x, (Fuller 2009). The model- 
variance is the variance under the working superpopulation 
model for the regression of y, on x,. Thus, instrumental- 
variable can be used to improve the efficiency of the 
resulting calibration estimator, in addition to avoid extreme 
final weights. Furthermore, the optimal instrumental- 
variable can be trimmed as in (28) to make the final 
weightsbounded. Further investigation of the optimal choice 
of ® is beyond the scope of this paper and will be a topic 
of future research. 
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Remark 3. Deville and Sdrndal (1992) also considered 
range-restricted calibration weights of the form 


L(U -1)+U(1-L)exp(K 2x, ) 
(U -1)+(1-L)exp(Ki'x,) 


where K =(U —L)/{(1—L)U —1)}, for some L and U 

such that 0<L<1<U. If calibration constraints (2) and 
(4) are to be satisfied, then we can use h. +4| x, instead of 
ix, in (31). The resulting calibration estimator is 
asymptotically equivalent to the regression estimator using 
the weights in (5) while the IVET estimator is asymptotically 
equivalent to the instrumental-variable regression estimator 


w, =d,g(h) =d, (31) 


A 


(29). Computation for obtaining is somewhat compli- 
cated because Og,(X)/OX is not easy to evaluate in (31). In 
the IVET estimator, the computation, given by (27), is 
straightforward. 


To compare the proposed weight with existing methods, 
we consider an artificial example of a simple random 
sample with size n= 5 where x, =k, k =1, 2, ....5:. Cal- 


culations are for three population means of x; X,, =3, 
X, =4.5, and X, =6. Table 1 presents the resulting 
weights for the regression estimator, the empirical like- 
lihood (EL) estimator, the ¢-step ET estimator (16) with 
t=1 and ¢=10, and the f-step instrumental variable 
exponential tilting (IVET) estimator (26) with t=1 and 
t=10. For the IVET estimator, the instrumental variable z, 


is created by 


bee ath. << 1.5 
Z,=5x, “if x, € (1.5;-4.5) 
ADS ait x, 245. 


The last column of Table 1 presents the estimated mean of 
X using the respective calibration weights. All the weights 
are equal to 1/n =0.2 for XY, =3. The regression estimator 
is linearly increasing in x, but has negative weights for the 
population with Y,, =4.5 and X,, =6. For the population 
where X, =6, the weights could not be computed for the 
EL method because , is outside the range of the sample 
x,’s. In this extreme case of X, =6, the ET method 
provides nonnegative weights by sacrificing the calibration 
constraint and the EL estimator has more extreme weights 
than the ET estimator or [VET estimator in the sense that 
the weight for k =5 is the largest among the estimators 
considered. The weight for the one-step ET estimator is 
close to that of the regression estimator for large x, but it is 
close to that of EL estimator for small x, The 10-step ET 
estimators has better calibration properties in the sense of 
smaller value of squared error, (S}_, w,x, — X y a than the 
one-step ET estimator. The ET estimator and the [VET 
estimator provide almost the same estimates of X, for both 
t, but the IVET estimator produces less extreme weights 
than the ET estimator. 
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Table 1 
An example of calibration weights with a sample of size n =5 


Method XN 1 2 
Reg. 3.0 0.200 0.200 
4.5 -0.100 0.050 
6.0 -0.400 -0.100 
EL 3.0 0.200 0.200 
4.5 0.033 0.043 
6.0 N/A N/A 
EY (¢=1) 3.0 0.200 0.200 
45 0.027 0.057 
6.0 0.002 0.009 
ET (t=10) 3.0 0.200 0.200 
4.5 0.009 0.027 
6.0 0.000 0.000 
IVET @=1) 3.0 0.200 0.200 
4.5 0.030 0.047 
6.0 0.003 0.006 
IVET (t=10) 3.0 0.200 0.200 
4.5 0.007 0.015 
6.0 0.000 0.000 


xj 

3 4 5 ein 
0.200 0.200 0.200 3.0 
0.200 0.035 0.500 4.5 
0.200 0.500 0.800 6.0 
0.200 0.200 0.200 3.0 
0.063 0.115 0.746 4.5 
N/A N/A N/A N/A 
0.200 0.200 0.200 3.0 
0.100 0.255 0.540 4.2 
0.039 0.173 0.777 4.7 
0.200 0.200 0.200 3.0 
0.078 0.227 0.659 4.5 
0.000 0.001 0.999 5.0 
0.200 0.200 0.200 3.0 
0.121 0.309 0.493 4.2 
0.041 0.267 0.683 46 
0.200 0.200 0.200 3.0 
0.066 0.294 0.618 4.5 
0.000 0.087 0.913 4.9 


Reg., Regression estimator; EL, empirical likelihood; ET, exponential tilting; IVET, instrumental variable exponential tilting; N/A, Not 


applicable. 


4. Variance estimation 


We now discuss variance estimation of the ET calibra- 
tion estimators of Sections 2 and 3. Because the estimated 
parameter (Ags MW ) in the ET calibration estimator (16) has 
some sampling variability, variance estimation method 
should take into account of this sampling variability of these 
estimated parameters. In this case, variance estimation can 
be often obtained by a linearization method or by a 
replication method (Wolter 2007). For the discussion of the 
linearization method, let the variance of the HT estimator 
(1) be consistently estimated by 
VY) = oy b= OQ Yy;- (32) 

ieA jeEA 
The linearization variance estimator for the ET estimator 
can be obtained by the linearization variance formula for the 
regression estimator, as in Deville and Sarndal (1992), using 
the asymptotic equivalence between the ET calibration 
estimator andthe regression estimator, as shown in Theorem 
2. Specifically, if the population size N is known, a 
linearization variance estimator of the [VET estimator in 

(26) can be written as 

V Vag = Die) sO fae, BF (33) 

ieA jeA 
where (2, are the coefficients of the variance estimator in 
(32), ge =w//d, is the weight adjustment factor, and 
é =y,-Y,—-(x,—X,)'B., where B, is defined in (30). 
The Spe of z, =x, in (33) gives the linearized variance 
estimator for the ET estimator in (16). Consistency of the 


variance estimator (33) can be found in Kim and Park 
(2010). 
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For the one-step ET estimator, a replication method can 
be easily implemented. Let the replication variance esti- 
mator be of the form 


5 
aD ell CP Se (34) 
k=] 


where ZL is the number of replication, c, is the replication 
factor associated with replicate k, Y‘ =y,.,d“y,, and 
rhe is the k™ replicate of the design weight daghor 
example, the replication variance estimator (34) includes the 
jackknife and the bootstrap (see Rust and Rao 1996). 
Assume that the replication variance estimator (34) is a 
consistent estimator for the variance of VR The k™ 
replicate of the one-step ET estimator can be computed by 


1 ie oy ne exp (Asay oe ee ap Vi (35) 
ieA 
where 
E| 
ie {Dar -RP @,- Zp [NP } (X/N@_X™), 
ieA 
if N= 
Hoey ‘ : , F 
Ny en) Gates lad a= gg 
iéA 
(k) 
(X", Z) me yeu (x;, Z; i) 
d (k) 2 
Dasa d; 
and 


aA 


2 N 
exp(Aiih) = A . 
bate d exp(z en) 


The replication variance estimator defined by 
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oes = > Gor —YVer), (36) 


where Y.*) is defined in (35), can be used to estimate the 
variance of the ET calibration estimator in (26). 


5. Simulation study 


To study the finite sample performance of the proposed 
estimators, we performed a limited simulation study. In the 
simulation, two finite populations of size N = 10,000 were 
independently generated. In population A, the finite popula- 
tion is generated from an infinite population specified by 
eee OC te vo ot ee en el xe NV (0:1); 
21(%. ¥)~x2() +|y;|. In population B, (x, @, z)) 
are the same as in population A but y, =(5—1/V8) + 
1/8 (x, —2)’ +e,. The auxiliary variable, x,, is used for 
calibration and z, is the measure of size used for unequal 
probability sampling. From both of the finite populations 
generated, M =10,000 Monte Carlo samples of size n 
were independently generated under two sampling schemes 
described below. The parameter of interest is the population 
mean of y and we assume that the population size N is 
known. 

The simulation setup can be described as a 2x 2x8x2 
factorial design with four factors. The factors are (a) two 
types of finite populations, (b) Sampling mechanism: simple 
random sampling and probability proportional to size (z, ) 
sampling with replacement, (c) Calibration method: no 
calibration, the regression estimator, the EL method in (6) 
with t=1 and t=10, the t-step ET method in (21) with 
t=1 and t=10, and the IVET method (26) with t=1 and 
t=10, (d) sample size: n=100 and n=200. Since N is 
assumed to be known, the calibration estimators are 
computed to satisfy > 7,w,(1, x,) = (, Pe v) in both 
populations. For the TVET method (26), the instrumental 
variable z, is created using the definitions in (28) with 
threshold C =3. 

Using the Monte Carlo samples generated as above, the 
biases and the mean squared errors of the eight estimators of 
the population mean of y, the variable of interest, were 
computed and are presented in Table 2. The calibration 
estimators are biased but the bias is small if the regression 
model holds or the sample size is large. In population A, the 
linear regression model holds and the regression estimator is 
efficient in terms of mean squared errors. However, the 
regression estimator is not efficient in population B because 
the model used for the regression estimator is not a good fit. 
The seven calibration estimators show similar performances 
for the larger sample size. The 10-step [VET estimator 
performs as well as the regression estimator in population 
A, and it shows slightly better performance than the other 


151 


six calibration estimators. In population B, the 10-step [VET 
estimator performs the best among the calibration estimators 
considered. 

In addition to point estimation, variance estimation was 
also considered. We considered only the variance estimation 
for the ¢-step ET estimators and [VET estimators. The 
linearization variance estimator in (33) and the replication 
variance estimator in (36) were computed for each estimator 
in each sample. In the replication method, the jackknife 
method was used by deleting one element for each 
replication. The relative biases of the variance estimators 
were computed by dividing the Monte Carlo bias of the 
variance estimator by the Monte Carlo variance. The Monte 
Carlo relative biases of the linearization variance estimators 
and the replication variance estimators are presented in 
Table 3. The theoretical relative bias of the variance esti- 
mators is of order o(1), which is consistent with the 
simulation results in Table 3. The linearization variance 
estimator slightly underestimates the true variance because 
it ignores the second order term in the Taylor linearization. 
The replication variance estimator shows slight positive bias 
in the simulation. The biases of the variance estimators are 
generally smaller in absolute values in population A because 
the linear model holds. In population B, variance estimators 
for the IVET estimator are less biased than those for the ET 
estimator because of less extreme weights used by the VET 
estimator. 


6. Concluding remarks 


We have considered the problem of estimating Y with 
auxiliary information of the form E{U(X)}=0 with some 
known function U(-). The class of the linear estimators of 
the fom Y=),.4w,y, with Y,.4w,{l, U(x,)} =(¥, 0) 
and w, >0 is considered. If the density f(x;) of X is 
known up to ne€Q, then an efficient estimation can be 
implemented using the estimated importance weight 


f(X;3 No, w) 


w, Ceyhe 


. fGen) 


where d, are the initial weights and where y, , and are 
the maximum likelihood estimators of 4 based on the 
population and the sample, respectively. If the parametric 
form of f(x; nm) is unknown, thenthe exponential tilting 
weights of the form 


Wig) © exp {AU (x; )} 


can be used, where A is determined to satisfy 


Si wig eo: (37) 


icA 
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Table 2 
Monte Carlo Biases and Monte Carlo Mean squared errors of the point estimators for the mean of y, based on 10,000 Monte 
Carlo samples 


Population Sample Estimator SRS PPS 
Size Bias MSE Bias MSE 
A 100 No Calibration 0.00 0.02398 0.00 0.02023 
Regression estimator 0.00 0.01261 0.00 0.01289 
EL estimator (t= 1) 0.01 0.01369 0.01 0.01353 
EL estimator (¢ = 10) 0.00 0.01285 0.00 0.01289 
ET estimator (¢ = 1) 0.01 0.01334 0.01 0.01353 
ET estimator (¢ = 10) 0.00 0.01269 0.00 0.01289 
IVET estimator (¢= 1) 0.01 0.01309 0.01 0.01330 
IVET estimator (t= 10) 0.00 0.01263 0.00 0.01289 
200 No Calibration 0.00 0.01069 0.00 0.00925 
Regression estimator 0.00 0.00595 0.00 0.00568 
EL estimator (t= 1) 0.01 0.00632 0.01 0.00604 
EL estimator (f= 10) 0.00 0.00597 0.00 0.00568 
ET estimator (t= 1) 0.00 0.00616 0.01 0.00578 
ET estimator (¢ = 10) 0.00 0.00596 0.00 0.00568 
IVET estimator (f= 1) 0.00 0.00605 0.01 0.00574 
IVET estimator (¢ = 10) 0.00 0.00591 0.00 0.00567 
B 100 No Calibration 0.00 0.02044 0.00 0.01692 
Regression estimator -0.01 0.01473 0.00 0.01461 
EL estimator (¢ = 1) 0.01 0.01652 0.01 0.01516 
EL estimator (¢ = 10) 0.00 0.01490 0.01 0.01472 
ET estimator (¢ = 1) 0.00 0.01516 0.01 0.01483 
ET estimator (f = 10) 0.00 0.01470 0.00 0.01459 
IVET estimator (t= 1) 0.00 0.01497 0.00 0.01458 
IVET estimator (¢ = 10) 0.00 0.01472 0.00 0.01453 
200 No Calibration 0.00 0.00888 0.00 0.00823 
Regression estimator -0.01 0.00705 0.00 0.00735 
EL estimator (¢ = 1) 0.01 0.00769 0.01 0.00764 
EL estimator (¢ = 10) 0.00 0.00715 0.01 0.00745 
ET estimator (t= 1) 0.00 0.00723 0.01 0.00749 
ET estimator (t = 10) 0.00 0.00706 0.01 0.00734 
IVET estimator (t= 1) 0.00 0.00704 0.00 0.00728 
IVET estimator (¢ = 10) 0.00 0.00699 0.00 0.00725 


SRS, simple random sampling; PPS, probability proportional to size sampling; MSE, mean squared error; EL, empirical likelihood; ET, 
exponential tilting; IVET, instrumental-variable exponential tilting. 


Table 3 
Monte Carlo Relative Biases of the variance estimators, based on 10,000 Monte Carlo samples 
Population Sample Estimator Linearization Replication 
size SRS PPS SRS PPS 
A 100 Ist (@ = IN) -7.02 -2.66 10.65 4.11 
Fle (1.0) -4.9] -0.80 5.60 0.67 
[IVEGE (f=) -5.28 -3.63 7.67 DHS) 
IVET (¢=10) -4.11 -0.87 4.96 0.41 
200 ET (=1) -3.97 -0.19 3.65 0.57 
BT @=10) 4, O73} 0.87 223 -0.35 
VBS (0) -3.35 -0.10 2.34 0.02 
IVET (t=10) -2.72 0.78 1.62 -0.53 
B 100 ET (¢=1) -7.64 -3.01 LOWZ 4.50 
BD (10) -5.98 -0.98 7.21 0.74 
DV (fa) -5.77 -2.31 4.53 -0.10 
IVER 1 =10) -5.44 -1.86 Sali -0.51 
200 ge @= 1) -2.4] -1.01 5.76 PES} 
Ee 10) -1.29 0.18 4.30 1.91 
INASAD (Gash) = 1-39 -0.35 2.09 1.04 
LIVED (f= 10) -1.15 -0.06 2.04 0.99 


SRS, simple random sampling; PPS, probability proportional to size sampling; ET, exponential tilting; IVET, instrumental-variable 
exponential tilting. 
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If a solution to (37) exists, it can be expressed as the limit of 
the form 

t-1 
o II exp{—U! 
where U/,) = Dies WisyU (%;)5 Zaaiey = Lied Wig {U (%;) - 

ae Uy = Lies Wig (&%,)/ Lies Wig With the initial weight 

Wyo) = 4,(N IN, ). If the solution to condition (37) does not 
exist, we can still use the weights in (38), but the equality 
must be relaxed. Instead, approximate equality will be 
satisfied in (37) in the sense that }-.4w,,,U(x;) converges 
to zero much faster than Yj-4W,oU(x;) for ¢ 2 1. 
Approximate equality in (37) is called the approximate 
calibration condition. 

The estimators us = DVieaWyy; that use the ¢- -step ET 
weights in (38), including the one-step estimator Five 
asymptotically equivalent to the regression estimator of ie 
form 


U(x;)} (38) 


(5) om 


Yee = Yo) — U(oy™ cat) ap(0)> 

where Yo) =LieaWo)¥; and See wicrWio) (UG) — 
Uo)} ¥;- Unlike the regression estimator, the weights of the 
proposed method are always nonnegative. Furthermore, 
using the instrumental variable technique in Section 3, the 
weights are bounded above. Suitable choice of the instru- 
mental variable also improves the efficiency of the resulting 
calibration estimator. 

The exponential tilting calibration method is asympto- 
tically equivalent to the empirical likelihood calibration 
method but it is more attractive computationally in the sense 
that the partial derivatives are not required in the iterative 
computation. Because the computation is simple, the 
variance of the proposed estimator can be easily estimated 
using a replication method, as discussed in Section 4. 
Further investigation in this direction, including interval 
estimation, can be a topic of future research. 
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Appendix 
A. Assumptions and proof of Theorem | 
We first assume the following regularity conditions: 


[A-1] The density /(x;) is twice differentiable with 
respect to n forevery x and satisfy 


O° f(x 0) 
On,On|, 


for function K(x) such that E{K(x)}<o, in a 
neighborhood of 1, ,- 

[A-2] The pseudo maximum likelihood estimator 
satisfies Vn(i— 9, y) =O, (1). 

[A-3] The matrix E{s (i, exists and is nonsingular, 
where s(n, y) =0ln f(x;; n)/On| eae 


To prove Theorem 1, write 


< K(x) 


f(x Xs No, wy) 


g,(M) =- ae 


and w,(n)=d,g,(). The estimated importance weight in 
(8) can be written w, = w,(q). Taking a Taylor expansion 
of N'Y ,.,4,8,() =0 around 4, y leads to 


yd, $;(No,) 


N ies 


ae djs $:(No y ha — Now) 


icA 


si 0, (1 Now |). 


Note that the first term on the nght side of 


O° f (x;; M)/Onon’ 
d l 
va = os he F(x 0) 
; lo @2 
$54) Fe ) "} (AL) 
N ied F(x W 


converges to /{0° f(x:y)/ Oyen} dx which equals to zero 
by the dominated convergence theorem with [Al]. The 
second term converges to E'{s (No, ay oe Thus, by [A-2], 


Sued Sli On) (A2) 


Nica 
and 
(A3) 


i Noy = 2 Sod Op (n oo). 
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Now, taking a Taylor expansion of N'Y,=N'! 
Dierw,(H)y, around n= y leads to 


ce 

N 
01 - : 

Raat 2. (Nov) Ji | (N= No.y)t+0,UN- No wl) (A4) 


on N ieA 


by the uniform continuity of 0{%,.,w,(m) y,}/ On around 
No.v- Now, using 


é Faw , FAsn)/on 
=o) = , =-g,(n)xs,(n), 
on F(xsn) #30) 

where s,(y) = Oln f(x,;; 9)/On, we have 


0 
a2 = 2 WD) a 
Using w;(q) y)=4; and writing 8; (my) =S;o, we have, 
by (A2), 
a1 
c 


Ss dSi9); 


l 
Te >; (Mho,w)d 7 sae 
ieA 


On N iecA 


= Oe): (AS) 


Using (A5) and (A3) in (A4), result (9) is obtained. 


B. Proof of Theorem 2 


Write 
Dries WM) Y; 
ie, djm;( ) 


where m,(A,)=exp(A, x,). Note that ioe =NO(Ciay) 
and A,,. is defined in (19). By a Taylor expansion of 
GOun= Mere around 2, = 0 and by the continuity of 
the partial derivatives of 8(A,), we have 


0(4,) = 


O(2i)) = 8(0) + (0)! (Ay) —0) +0, (Ay —0|), (BL) 


\(t) 


where 6(2) = 80(A)/ Ov. Because thes converges in qua- 
dratic order and the one-step estimator satisfies ,..) = 
O, (nS); equation (22) can be written as 


+ Oe V (B2) 


Note that 


O(A, )= iz d,m,(i, ) >: dm, (2, ){¥; — 0(2, yf 


icA ieA 
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where m,(2,)=0m,(A,)/02,. Using m,(0)=1 and 
m,(0)=x,, we have 6(0)=Y,/N, and 
(0) =N;z' >. d,(x, —X,)y. 
icA 


Therefore, inserting (B2) and (B3) into (B1), we have 


(B3) 


A 


ee y 
B(x) = 
Ny 
x , -| 
(> xX, 2 d; @-)"} Didi X-Xa) J; 
N icA ied 
+0, (n''"), 
which proves (23). 


References 


Beaumont, J.-F., and Bocci, C. (2008). Another look at ridge 
calibration. Metron, LXVI, 5-20. 


Breidt, F.J., Claeskens, G. and Opsomer, J.D. (2005). Model-assisted 
estimation for complex surveys using penalised _ splines. 
Biometrika, 92, 831-846. 


Chambers, R.L. (1996). Robust case-weighting for multipurpose 
establishment surveys. Journal of Official Statistics, 12, 3-32. 


Chen, J., and Qin, J. (1993). Empirical likelihood estimation for finite 
populations and the effective usage of auxiliary information. 
Biometrika, 80, 107-116. 


Chen, J., and Sitter, R.R. (1999). A pseudo empirical likelihood 
approach to the effective use of auxiliary information in complex 
surveys. Statistica Sinica, 9, 385-406. 


Chen, J., Variyath, A.M. and Abraham, B. (2008). Adjusted empirical 
likelihood and its properties. Journal of Computational and 
Graphical Statistics, 17, 426-443. 


Deville, J.-C., and Sarndal, C.-E. (1992). Calibration estimators in 
survey sampling. Journal of the American Statistical Association, 
87, 376-382. 


Deville, J.-C., Sarndal, C.-E. and Sautory, O. (1993). Generalized 
raking procedure in survey sampling, Journal of the American 
Statistical Association, 88, 1013-1020. 


Efron, B. (1981). Nonparametric standard errors and confidence 
intervals. Canadian Journal of Statistics, 9, 139-172. 


Estevao, V.M., and Sarndal, C.-E. (2000). A functional approach to 
calibration. Journal of Official Statistics, 16, 379-399. 


Folsom, R.E. (1991). Exponential and logistic weight adjustment for 
sampling and nonresponse error reduction. In Proceedings of the 
Section on Social Statistics, American Statistical Association, 197- 
202. 


Fuller, W.A. (2002). Regression estimation for sample surveys. 
Survey Methodology, 28, 5-23. 


Survey Methodology, December 2010 


Fuller, W.A. (2009). Sampling Statistics. Hoboken, New Jersey: John 
Wiley & Sons, Inc. 


Givens, G.H., and Hoeting, J.A. (2005). Computational Statistics. 
Hoboken, New Jersey: John Wiley & Sons, Inc. 


Henmi, M., Yoshida, R. and Eguchi, S. (2007). Importance sampling 
via the estimated sampler. Biometrika, 94, 985-991. 


Imbens, G.W. (2002). Generalized method of moments and empirical 
likelihood. Journal of Business and Economic Statististics, 20, 
493-506. 


Isaki, C., and Fuller, W.A. (1982). Survey design under the regression 
superpopulation model. Journal of the American Statistical 
Association, 77, 89-96. 


Kim, J.K. (2009). Calibration estimation using empirical likelihood in 
survey sampling. Statistica Sinica, 19, 145-157. 


Kim, J.K., and Park, M. (2010). Calibration estimation in survey 
sampling. /nternational Statistical Review, In press. 


Kott, P.S. (2003). A practical use for instrumental-variable 
calibration. Journal of Official Statistics, 19, 265-272. 


Kott, P.S. (2006). Using calibration weighting to adjust for 
nonresponse and coverage errors. Survey Methodology, 32, 133- 
142. 


Kitamura, Y., and Stutzer, M. (1997). An information-theoretic 
alternative to generalized method of moments estimation. 
Econometrica, 65, 861-874. 


155 


Park, M., and Fuller, W.A. (2009). The mixed model for survey 
regression estimation. Journal of Statistical Planning and 
Inference, 139, 1320-1331. 


Rao, J.N.K., and Singh, A. (1997). A ridge shrinkage method for 
range restricted weight calibration in survey sampling. In 
Proceedings of the Section on Survey Research Methods, 
American Statist Association, 57-64. 


Rust, K.F., and Rao, J.N.K. (1996). Variance estimation for complex 
surveys using replication techniques. Statistical Methods in 
Medical Research, 5, 283-310. 


Samdal, C.-E. (2007). The calibration approach in survey theory and 
practice. Survey Methodology, 33, 99-119. 


Samdal, C.-E., Swenson, B. and Wretman, J.H. (1992). Model 
Assisted Survey Sampling. New York: Springer. 


Tillé, Y. (1998). Estimation in surveys using conditional probabilities: 
Simple random sampling. /nternational Statistical Review, 66, 
303-322. 


Wolter, K.M. (2007). Introduction to Variance Estimation. 2™ Ed. 
New York: Springer-Verlag. 


Wu, C., and Rao, J.N.K. (2006). Pseudo empirical likelihood ratio 
confidence intervals for complex surveys. Canadian Journal of 
Statistics, 34, 359-375. 


Statistics Canada, Catalogue No. 12-001-X 


ay Vir 


rite ai Snes teh ay ae oththat’, Men 
& — aire ib dinate 
- we , EAT act) 


) abt cgubairte aphin b GO) Ltt tp aE Oe 
o “tye ww « heel :& tors Nena, aa 
| —o— Wr vee ne 
pearcht gan pores 
ne hs x 
n>) ~emee') GOR) OLE all aT ot 
Ms. oa Peta Fees ee, snes. Pe on 
ws — Te) oe 
4 (ant ae 0 Carga neeiieg fT (RD + Mband 
W) hy 1? pag ory ooh 
ce) i a) om +> ohes® 
epee 47.. “heer? o! poe 
‘ — e! ) ane v— i rai) + ait 
“ral (e ealyee mote dor? 
A con 
er eee er ae 
uke rca du Veege? 
% i | \ ° A Jee) AA mS ee 2S ae 
¥ = ry ae] ctype’ wh « me esas 
{29 . ite 
i LAS 
rai i , 
: ; 
oe 7 
Ts Sr 
S64 a ay 
= oe Pe oe 
) 
fica WA = iri 
) (a. Gap eees « bw as 7 OO. 
2 a> ‘wat £6 ge Ghee oe 
LO \2) en & ecte = wh 
_ 
& 9 j ¢ 4 : > *— ei 
- 
OA / 


er - 
> 7 a> i) ae 
i? = esl oe 

- 


A 


_ 


i 
A 


peer ys 
pete ? 


wees 
oree © 


omer 
toa a 7 


ee js of gm 63 s\@=x0 r 
agente 1s te Geet egal 
ae ed ee 


Survey Methodology, December 2010 
Vol. 36, No. 2, pp. 157-170 
Statistics Canada, Catalogue No. 12-001-X 


157 


Comparison of survey regression techniques in 
the context of small area estimation of poverty 


Stephen J. Haslett, Marissa C. Isidro and Geoffrey Jones ' 


Abstract 


One key to poverty alleviation or eradication in the third world is reliable information on the poor and their location, so that 
interventions and assistance can be effectively targeted to the neediest people. Small area estimation is one statistical 
technique that is used to monitor poverty and to decide on aid allocation in pursuit of the Millennium Development Goals. 
Elbers, Lanjouw and Lanjouw (ELL) (2003) proposed a small area estimation methodology for income-based or 
expenditure-based poverty measures, which is implemented by the World Bank in its poverty mapping projects via the 
involvement of the central statistical agencies in many third world countries, including Cambodia, Lao PDR, the 
Philippines, Thailand and Vietnam, and is incorporated into the World Bank software program PovMap. In this paper, the 
ELL methodology which consists of first modeling survey data and then applying that model to census information is 
presented and discussed with strong emphasis on the first phase, 7.e., the fitting of regression models and on the estimated 
standard errors at the second phase. Other regression model fitting procedures such as the General Survey Regression (GSR) 
(as described in Lohr (1999) Chapter 11) and those used in existing small area estimation techniques: Pseudo-Empirical 
Best Linear Unbiased Prediction (Pseudo-EBLUP) approach (You and Rao 2002) and Iterative Weighted Estimating 
Equation (IWEE) method (You, Rao and Kovaéevi¢ 2003) are presented and compared with the ELL modeling strategy. 
The most significant difference between the ELL method and the other techniques is in the theoretical underpinning of the 
ELL model fitting procedure. An example based on the Philippines Family Income and Expenditure Survey is presented to 
show the differences in both the parameter estimates and their corresponding standard errors, and in the variance 
components generated from the different methods and the discussion is extended to the effect of these on the estimated 
accuracy of the final small area estimates themselves. The need for sound estimation of variance components, as well as 


regression estimates and estimates of their standard errors for small area estimation of poverty is emphasized. 


Key Words: Small area models; Nested error regression model; Poverty mapping. 


1. Introduction 


Poverty is a very complex multidimensional concern: 
there is no single definition and method of measurement 
available. In this paper, we adhere to the meaning of poverty 
that is used by most economists, i.e., households are consid- 
ered to be in poverty if their income falls below some 
income threshold called the poverty line. Chambers (2006) 
described this as income-poverty, and it is the definition 
adopted by the World Bank in the implementation of their 
small area poverty mapping projects carried out in conjunc- 
tion with national statistical agencies and used, for example, 
for monitoring progress towards the Millennium Develop- 
ment Goals (UN website). Sometimes expenditure-based 
poverty measures are used instead to assess economic 
poverty. In public health related contexts, different measures 
such as standardized weight for age, height for age and 
weight for height for children (underweight, stunting and 
wasting, respectively) are used, e.g., in Bangladesh (Haslett 
and Jones 2004) and Nepal (Haslett and Jones 2006). 

Surveys conducted in most third world countries usually 
allow an acceptable level of precision for reporting poverty 
statistics at the first and second administrative level or 
geographical area (e.g., for the Philippines - National and 


Region respectively). However, for policy makers to prop- 
erly target assistance and interventions to the neediest 
communities and households, more disaggregated finer- 
level poverty statistics are needed. However, survey based 
poverty statistics at smaller geographical areas or lower 
administrative level are usually less reliable (have higher 
standard errors) due to smaller sample sizes, and this is 
where small area estimation comes into play. 

The most common small area estimation methodology 
used for poverty measures in third world countries proposed 
by Elbers, Lanjouw and Lanjouw (ELL) (2002, 2003) 
allows generation of more precise estimates for smaller 
geographical areas by combining the survey data with 
information from a recent census. The ELL method consists 
of two phases: fitting a regression model (or models) to 
complex survey data and using that model to predict income 
or expenditure per capita at household level (which is 
transformed and aggregated to estimate poverty statistics at 
small area level). 

In this paper, we focus specifically on the various 
algorithms used to fit the phase | regression models, and to 
estimate regression parameter standard errors and variance 
components from survey data. We emphasise consequences 
of survey regression modeling decisions rather than the 
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entire and rather comprehensive system ELL use to form 
small area estimates. 

The preliminary requirement of the ELL methodology 
applied to economic measures is to develop an accurate 
model of per capita income or expenditure of households 
although this is often used to generate non-linear functions 
of income or expenditure (e.g., poverty incidence - percent- 
age of households below the poverty line, or poverty gap - 
sum of relative differences in income or expenditure for 
households or individuals below the poverty line). The 
survey-based regression model developed for income or 
expenditure is critical to accurate poverty statistics, but as 
we show below the regression model itself is not always the 
most important element, and other issues such as estimation 
of variance components deserve emphasis. 

Other existing survey-based small area estimation regres- 
sion techniques - Pseudo-Empirical Best Linear Unbiased 
Prediction (Pseudo-EBLUP) approach (You and Rao 2002), 
Iterative Weighted Estimating Equation (IWEE) method 
(You et al. 2003) and the General Survey Regression (GSR) 
(Skinner, Holt and Smith 1989) method are considered as 
alternative survey based model-fitting techniques and 
compared with two variations of the ELL method for fitting 
regression models to survey data. Our investigation is based 
on real data from the 2000 Philippine Family Income and 
Expenditure Survey (FIES), rather than simulated data. 

This paper is organized as follows: Section 2 gives 
relevant background on small area models; the model for 
income (or expenditure) as presented by Elbers, Lanjouw 
and Lanjouw is given in Section 3; presented in Section 4 is 
a summary of the ELL methodology, followed by details on 
the alternative fitting methods in Section 5, which includes 
the Pseudo-Empirical Best Linear Unbiased Prediction Ap- 
proach (5.1), IWEE Method (5.2), and the General Survey 
Regression Method (5.3). Section 6 discusses differences 
between the techniques, while Section 7 presents their appli- 
cation to the Philippine FIES 2000 data. This is followed by 
the conclusion and recommendations (Section 8). 


2. Small area models 


Ghosh and Rao (1994) classify small area models into 
two broad categories, area level and unit level models. Area 
level models refer to sets of models that can be considered 
when only area-specific auxiliary variables are available. 
Unit level models, on the other hand, refer to models that 
can be considered when there are unit-specific auxiliary 
variables and unit level values of the variable under study 
can be used. All such models are special cases of a general 
linear or generalized linear mixed model, and usually in- 
volve both fixed and random effects. 
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For area level models, it is assumed that the population 
mean (Y,) of the a™ small area or some suitable function 
@, = 9(Y,) is related to the area-specific auxiliary variables 
X, =(Xy ++» Xap) through a linear model 


G. =x pia (1) 


where a=l.,...,k, v, ~ iid(0, 02), B is a vector of regres- 
sion parameters, c, are known or estimated positive con- 
stants to allow for heteroscedasticity, k is the total number 
of small areas under study and p is the number of auxiliary 
variables. It is assumed that a direct design-based estimator, 


Y,, of the population mean Y, is available whenever the 


a 


area sample size n, 21, and that 
6, =0, +2, (2) 


where 6, = 2(¥.) and the sampling errors e, are indepen- 
dent (0, V,) with known variance V,. Combining equa- 
tion (1) and (2) gives the area level linear mixed model: 


0, a x/,B 3 CaVq 1 €,: (3) 


We note that (3) involves both design-based random 
variables e, and model-based random variables v, (Rao 
1999), where design-based variables are due to the sample 
selection mechanism, and model-based ones to the super- 
population structure in which the model is embedded. 

Area level models have various extensions so they can 
for example handle correlated sampling errors, spatial de- 
pendence of random small area effects, time series and 
cross-sectional data (see Rao 2003, 1999 and Ghosh and 
Rao 1994). 

The unit level model assumes that the variable of interest 
Y,, for the h™ unit in the a small area is related to the 
element-specific auxiliary data X,, = (Xjy15++-s Xap) through 
a nested error regression model: 


Ys =x P+, +e), (4) 


a 


where a=].,...,k, h=1,... NV, B=(B,, CA iss gp 5c 
vector of regression parameters and N, is the number of 
population units or households in the a™ small area. It is 
also assumed that the random effects v,, are iid N(0, 0?) 
and are independent of the unit errors e,, which are 
assumed to be iid N(0, 02). Extensions that allow errors to 
be heteroscedastic, with known scaling constant(s) are also 
possible. 

The ELL method uses a unit level model, where the units 
are households in the case of income or expenditure data, 
and where the variation is modeled at primary sampling 
unit, 7.e., cluster level and household level. Note that ELL 
do not include model variation at small area level, only for 
cluster within small area, and for household within cluster. 
This is the form of the basic model used for comparisons in 
this paper since ELL is the standard small area estimation 
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method for poverty in third world countries. In the real 
datasets we have studied this additional small area variation 
has been very small. Despite this empirical evidence how- 
ever, important questions remain about how best to estimate 
the small area variance component in the presence of cluster 
level variation, when there is sample survey weighting, 
especially where many of the small areas contain only one 
sampled cluster. 

The ELL model has a number of other characteristics not 
all of which are standard in a statistical sense (see Haslett 
and Jones 2005, for example). The intention of this paper is 
not to discuss differences in the available methods gener- 
ally, but to focus directly on how methods of fitting 
regression models to survey data differ when the ELL first 
phase “‘base structure” of fitting a survey regression model 
is used. The focus of this paper therefore is on comparison 
of the available methods of fitting regression models to 
survey data on income or expenditure using a specified set 
of regressors, even though ELL can also be (and is) used 
relatively routinely to find small area estimates for non- 
linear functions (e.g., poverty incidence, gap or severity) by 
applying fitted regression models to a census. 

The answer to the ‘best regression model fitting’ question 
for survey data on which this paper focuses (as with other 
matters related to the ELL methodology) is particularly 
important because there are billions of dollars of aid funding 
that are (or have the potential to be) allocated based on the 
regression models used as part of small area estimation of 


poverty. 


3. Income/consumption model 


Modeling per capita income or expenditure of house- 
holds instead of poverty measures themselves (such as 
poverty incidence and gap) is one of the distinctive features 
of the ELL method. As mentioned in the previous section, 
the ELL method involves fitting the income or expenditure 
model to the survey data and applying it to the census data 
prior to the generation of the small area estimates of poverty 
measures. The income/expenditure model is as follows: 


Yon = X44,B + Uy, (5) 


where b=1,..., M, h=1,...,.N,; ¥,, is the log-transformed 
per capita income or expenditure of the A™ unit or 
household in the b” cluster, M is the total number of 
clusters in the population and WN, is the total number of 
households in the 5" cluster in the population. x,,, is a set 
of the auxiliary variables available in both the survey and 
the census, which generally need to be contemporaneous; 
u,,, 18 the random error term representing that part of Y,, 
that cannot be explained by x,,. Income and expenditure 
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data almost invariably have a skewed distribution, hence a 
transformation (usually logarithmic) is applied to make the 
data more symmetrical. 

The households for which data on per capita income or 
expenditure is collected are seldom independent, but have 
natural groupings or clusters, often defined administratively. 
Households that are close to each other or in the same 
cluster, tend to be similar in many respects. In the survey 
data, the clusters are usually also the primary sampling units 
(PSUs) for the sample survey design. To account for the 
clustering of households, the random error term w,, in the 
regression model is usually assumed to have the following 
specification: 


Uy, = Vere, (6) 


where v and e are independent of each other and 
uncorrelated with x,,, v, is the error term held in common 
by the 5b group or cluster (eg., barangay for the 
Philippines) and e,, is the household level error within the 
cluster. The importance of each term is measured by their 
respective variances or variance components, o? and o?. 
There are various procedures for estimating these variances. 
This important topic is covered in the sections that follow. 
Model (5) can be written as 


Yon =Xp,BtV, + Sp, (7) 


which is similar in form to the unit level model or nested 
error regression model mentioned in the previous section. 
However while the form of the model is similar, the group 
being referred to is different, eg., Y,, refers to the h™ 
household in the a small area, while Y,, refers to the h™ 
household in the 5" cluster. Clusters, based on the survey 
design, will typically be much smaller than the areas for 
which small area estimates are sought, and generally (unlike 
almost all the small areas) not all clusters are sampled. For 
example in the Philippines, estimates are sought at the 
municipal level which is composed of barangays or clusters. 


4. The ELL methodology 


In the ELL methodology, the estimate of the regression 
parameter B is given, in Elbers etal. (2002, page 11 
footnote 8) and in the POVMAP software Zhao (2006) 
developed for the ELL method, as 


eh tn 
fas ( 3: XiW,¥s'%, | [Sx,w.vs'y.] (8) 
bel 


b= 


and the corresponding variance-covariance matrix as 
-] 
V@u)= o(S X,W,Vs'WiX D (9) 
bal 
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where V, =(o21, +0;1,,1),), (65) is the cluster level 
variance, while (o2) is the household level variance, I, is 
an identity matrix, 1, =(1...1) is a constant vector, D= 
(obi XW, V5 Xp) X= Kp > Xin, Vo= Vo > Von, ) 3 
W, is a diagonal matrix of sampling weights; m is the 
number of clusters in the sample and 7, is the number of 
households in each sampled cluster. Equation (8) assumes 
V,, is known. In practice we need to estimate o? and o? to 
get the estimator V,. We note that the variance expression 
in (9) is derived under a vaguely specified model assumed 
for the sample (see Elbers etal. 2002). Under the ELL 
method, fitting the income/expenditure model (7) involves 
obtaining the initial estimate of B through weighted least 
squares (WLS) method and using the residuals of the initial 
model to estimate the covariance matrix V, needed to 
obtain pes The estimate of the cluster level (0?) and 
household level (o2) variances, are derived by Elbers ef al. 
(2002) as follows: 


Wh. 1h )e wil — wor 
Aelen a) ,(u, —U_) es 1 ( ») ».o1 (10) 
ye w, (1 — w, Se w, (1— w, ) 


where 1?= >, (€, —6,.)?/(% (1 —1))3 W,= Xp Won! Dp Dh Won 
is the by-cluster transformed sampling weights which sum 
to one across clusters and w,, is the re-scaled sampling 
weights which sum to the total sample size. Here u, = 
dU, and u =>,>),u,, (which is equal to zero) where 
u,,, 18 aS defined in equation (6). 

There are two ways suggested by Elbers et al. (2002) to 
generate the estimate of the household level variance com- 
ponent: “direct” computation which is denoted by (62) or 
the heteroscedasticity model-based (62 ,,,). Direct compu- 
tation involves using the difference between the estimated 
mean square error from the initial WLS regression and the 
computed estimate of o?, while the heteroscedasticity 
model-based computation uses a logistic-type link function 
to bound the variance as follows: 


9, A z +B 
ae Car Q, A, »)=| exp ( pp O) | (11) 


1+exp(z;, a) 


where A and B are the upper and lower bounds respectively, 
estimated with the parameter vector a@ using a standard 
pseudomaximum likelihood procedure (Elbers et a/. 2003), 
and where z,, are auxiliary variables. Elbers et al. claim 
that imposing a minimum bound of zero and a maximum 
bound of A* =(1.05) max{e;,} in general yields similar 
estimates of the parameters a. These restrictions allow one 
to estimate the simpler form 


“ 
bh ' 

nl 2 | a Syn aut Top (12) 
~ bn 
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where 7,, is an error term and the other variables are as 
defined earlier. In most of the World Bank poverty mapping 
projects, slight modifications are usually made, for example, 
adding a constant 6 to e;, in model (11). 

By using model (12), and employing the delta method, 
6. py is computed as: 


A Cir, ae? A Cy, — Con) (13) 
TENCE ees 


fo) 
(er em? 


pM 


where C,, =exp{z,,@}, and 6, is the estimated variance 
of the residuals under model (12). If the household level 
variance component is based on a heteroscedastic model, 
then, V, =(02,,I,, +0;1,,1/,). Heteroscedasticity model- 
ing is conducted on the assumption that variation at the 
household level depends on some covariates. 

As discussed in more detail in the appendix, the way in 
which the weight matrix W, enters the calculation in equa- 
tion (9) above leads to an asymmetric estimated covariance 
matrix. A rather better approach based on ‘pseudomaximum 
likelihood’ is outlined by Pfeffermann, Skinner, Holmes, 
Goldstein and Rasbash (1998) and involves splitting 
X;,V,'X,, into separate sums of squares and cross-product 
terms, and weighting each appropriately - if we write V,! = 


cI, +d1,, 1), then the appropriate weighting is cX, W,X,,+ 
dXiW,1,,1;, W,X;. 


Since the ELL version, W,V,', is not generally sym- 
metric, neither is D in equation (9). As a consequence the 
supposed covariance matrix of B,,,, V(By,,). is also not 
symmetric. The POVMAP software attempts to solve this 
problem by taking the average of their V(B,,, ) and its 
transpose, thereby forcing the matrix to be symmetric. 

Note again that under the ELL method, the regression fit 
to the survey data and the estimation of variance compo- 
nents is only the first phase. The consequent phase involves 
prediction at household level based on the entire census data 
and aggregation to small area level. 

The survey fitting methods (derivation of the estimate of 
B and its corresponding variance-covariance matrix) of 
three alternative regression procedures to ELL are presented 
in the following sections. 


5. Alternative fitting methods 


5.1 The pseudo-empirical best linear unbiased 
prediction approach 


You and Rao (2002) proposed an estimator of the small 
area mean by deriving an estimator of B based on the unit 
level model (4). The process of deriving the estimator of B 
starts with the computation of the best linear unbiased 
predictor (BLUP) of v, given the parameters B, o? and 
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o? from the aggregated (survey-weighted) area level 
model: 

Yaar, Brug e,, (14) 


which proceeds as follows: 
vee (B, aa oO; ) = {pool pot va 


XB) (15) 

where w= aw 1 WanXan> 7 Ww l Wah Van Nee = o;/ (0; aL: 
5282), w w= Wy! Dey, 02 = Daw, and ww. are the 
unit level survey weights; fren solving for the survey- 


weighted estimating equation for B: 


ah 


k 
MS Wap XanlV Va a Daw (B, oo, or )] = 0 (16) 


from which the estimator of B is obtained as 
=I ae 
p= |S aaa | [PFW (17) 
a=! h=1 a=| h=1 


WHhelewh 7. = W(X Xp): <The 
covariance matrix is then as follows: 


(eas a 
=02 [> pe sat 


a=l\ hal 


corresponding 


a(e[e-)Het--y] 


The variance components are estimated using Henderson’s 
Method 3 (Henderson 1953), to generate unbiased estimates 
even in the presence of correlated elements in the model. 
The estimators of the variance components are as follows: 


62, =(n—-k-p+tly Spay an (19) 
a=| h=1 
where {€2,} are residuals from the OLS regression of 


Var Va) on {Xam X, Xai oe Xahp a and (¥ 
X,,») are the sample means in the a" group. 


) 
age era (20) 


a=| h=1 


/ Var Xa. press 


where n, =n—tr[(X'X) 2, n? x, x,] with X=(x...., 
x,), and the {z,,} are the residuals from the OLS regres- 
sion Of y,, ON {Xqq1>-+»Xamp$- For the model (7), the 


subscript a is replaced by b. 
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However, the Henderson’s estimators above do not 
account for the sampling weights. To address this, an esti- 
mation technique has been proposed by You ef al. (2003) 
which extends the Pseudo-EBLUP method by incorporating 
the weights in the estimation of the variance components. 
This is described in the next section. 


5.2 The iterative weighted estimating equation 
method 


The estimator proposed by You ef al. (2003) is similar to 
the Pseudo-EBLUP estimator, except that it incorporates the 
sampling weights in the computation of the variance com- 
ponents, and it generates the parameter estimate B and the 
variance components by using an iterative weighted esti- 
mating equation (IWEE) approach. The authors derived the 
estimator of o? and o? as follows: 


xy YR (t-1)72 
62) = ye ah l Mah = ociae (x5, ee) B : ‘] 
ew - Ae 
z fas -8; aE Way | 


~?) 
= 67) (B) (21) 
and 
1 k ~2(t-l) k 62) 
nt) <8 4 
ony -y —) Vege a y Opell. os Say BAY, Yaw 
k a=1 k a=l a=| 
= G4 (G02, 07). (22) 


The survey weighted estimates of B, 02, o? are obtained 
simultaneously by following iterative updating steps, ¢ in 
the equation above stands for the ¢ iteration. Since the 
variance components o? and o? are unknown, initial esti- 
mates for the iterative steps are generated by Henderson’s 
method. Again, as for Pseudo-EBLUP, for the ELL regres- 
sion model formulation (7), the subscript a is replaced 
by b. 

This approach is similar to the probability-weighted 
iterative generalized least squares (PIWGLS) method 
proposed by Pfeffermann ef a/. (1998) for fitting multilevel 
models where the estimation process considered the unequal 
selection probabilities at each stage of sampling and in- 
volves iterating between the parameter B and the variance 
components until convergence. A model-based approach is 
also proposed by Pfeffermann, Moura and Silva (2006), 
which involves deriving the hierarchical model for given 
sample data as a function of the population model and the 
selection probabilities, and then fitting the sample model 
using Bayesian approach by use of Markov Chain Monte 
Carlo algorithm. 


5.3 General survey regression method 


Another approach to generate the estimator of the 
parameter B and its variance is the design-based meth- 
odology for fitting regression models (Lohr 1999). This 
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technique is currently used in the Stata, Sudaan, and 
WesVar package, for example. The estimator of B given 
below is the sample weighted regression estimator for a 
model with homoscedastic variance structure and un- 
correlated observations in the population. 


6, =(X'WX) 1 X'Wy. (23) 


This estimator is not derived under the model specified 
by (7) even under the homoscedastic variances for house- 
hold errors. The linearized/robust variance estimate for B, 
is based on the design-based variance estimator for a total, 
given as, 


Vis) =D = ¥($ vu [$ vn D (24) 
h=\ 


i ah EN OE 


where d,, =é,,X;,3 €,, 1S the residual from WLS regres- 
sion; X,, is a vector of the independent variables; w,, is a 
sampling weight; D=(X'WX)"'; and W is a diagonal 
matrix of the sampling weights. 

The General Survey Regression method differs from the 
other techniques in the computation of the estimates, and 
generates the estimates without computing the variance 
components, o? and o2. As shown above, the equations 
for the estimator of the parameter B and its corresponding 
estimated covariance matrix only involve the sampling 
weights matrix W. The estimated covariance matrix in (24) 
is often referred to as a sandwich estimator. 


6. Comparison of the model fitting techniques 


The ELL methodology is claimed to be a weighted GLS 
estimation procedure. However, as pointed out earlier, the 
sampling weights are not properly incorporated in the 
estimation process and this leads to non-interpretability of 
the elements in some matrices involved in the estimation, as 
well as asymmetry in the estimated covariance matrix. For 
the ELL method of estimating the variance components, the 
weights are accounted for only at the cluster level. The two 
ways (direct computation and heteroscedasticity model- 
based) that ELL use for generating the household level 
variance component do not incorporate the sampling weights. 
For direct computation, the household level variance compo- 
nent is determined from the residual of the survey-weighted 
(WLS) regression conducted at the preliminary step and the 
weighted estimate of the cluster level component. The 
heteroscedasticity based computation is based on modeling 
the square of the residuals from the WLS regression. 

While the ELL methodology follows a GLS-like esti- 
mation procedure, the pseudo-EBLUP and IWEE method 
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follow the Generalized Estimating Equation (GEE) proce- 
dure (Liang and Zeger 1986) using an exchangeable working 
correlation matrix, i.e., all the off-diagonal elements of the 
correlation matrix within clusters are equal, and in Pseudo- 
EBLUP and IWEE are equal to o?/(o? +02). An ex- 
changeable or equicorrelated working correlation matrix is 
one of the common working correlation matrices presented 
in the paper of Horton and Lipsitz (1999) when reviewing 
different software for fitting GEE regression models. 

The two procedures, Pseudo-EBLUP and IWEE, both 
incorporate the sampling weights in the estimation of the 
parameter B and the corresponding standard error, although 
the Pseudo-EBLUP method uses Henderson’s method in the 
estimation of the variance components. While Henderson’s 
method generates unweighted estimates of the variance 
components, the [WEE method incorporates the sampling 
weights iteratively from estimation of variance components 
for computation of standard error of the estimate of the 
regression parameter. 

There is a very limited published literature on the appli- 
cation to real data sets of the Pseudo-EBLUP and IWEE 
methods. Those that there are consider the clusters as the 
small area, and often use the data in Battese, Harter and 
Fuller (1988), whose data set contains information on 
hectares of corn and soybeans per segment for counties in 
North Central Iowa and assumes simple random sampling 
within areas or clusters. An exception is the recent paper by 
Militino, Ugarte, Goicoa and Gonzalez-Audicana (2006), 
which applies Pseudo-EBLUP to estimating the total area 
occupied by olive trees in Navarra, Spain, where (as in 
Battese etal.) the units are self weighting. Generally for 
poverty estimation, Pseudo-EBLUP and IWEE techniques 
must be applied in more complex situations, since sampling 
clusters and small areas are not identical and the sample is 
not self weighting. In the example in the next section, the 
clusters (barangay) are different from the small areas 
(municipalities), the clusters are sub-units of the small area 
and the sampling scheme is not self weighting. 

The GSR method is one of the least complicated esti- 
mation procedures as it employs a weighted least squares 
procedure using the sandwich estimator for estimating the 
variance of the estimator of the regression parameter. As 
mentioned earlier, this method differs from the other tech- 
niques in that the estimate of the regression parameters and 
their corresponding standard errors are generated without 
computing the variance components. 

Based on the discussion above, for all the techniques 
considered, the survey-based estimation procedure for the 
parameter PB and its corresponding standard error are 
theoretically sound given their assumptions, except for the 
ELL method where there are some inconsistencies in the 
estimation of parameters B and the covariance of . 
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7. Application to real data 


In this section, the four different regression techniques 
(one of which contains two variants of ELL) are compared 
using the Philippine 2000 Family Income and Expenditure 
Survey (FIES). The FIES data is a nationwide survey 
undertaken by the Philippines National Statistics Office 
(NSO) every three years. The survey gathers details on 
family income and expenditure as well as information 
affecting income and expenditure. Selected households are 
interviewed in two separate operations, each covering a 
half-year period, in order to allow for seasonal patterns in 
income and expenditure. For FIES 2000 the interviews were 
conducted in July 2000, for the period 01 January to 30 June 
and January 2001 for the period 01 July to 31 December. 
The sample design for FIES used a multi-stage stratified 
random sampling technique. Barangays are the primary 
sampling units (PSUs) and are stratified into urban and rural 
within each province and selected using systematic sam- 
pling with probability proportional to size. Large barangays 
are further divided into enumeration areas and subjected to 
further sampling before the final stage in which households 
are systematically sampled from the 1995 Population 
Census List of Households. Interview non-response was 
only 3.4 percent, with 39,615 of the sample households 
being successfully interviewed in both survey visits. 
Deterministic imputation was done to address item non- 
response, i.e., entry for a particular missing item is deduced 
from other items in the questionnaire. 

The auxiliary variables used in this paper are adopted 
from the variables included in the model formulated by 
Haslett and Jones (2005) that was fitted without using 
POVMAP for the small area poverty mapping project in the 
Philippines. The auxiliary variables included both house- 
hold characteristics and municipal means (in which the 
household data used have the same value for every sampled 
household in a given municipality, 7.e., small area). These 
auxiliary variables are not only derived from the FIES data 
but also from the Philippine 2000 Labor Force Survey 
(LFS) and Census of Population and Housing (CPH). The 
LFS collects socioeconomic characteristics of the popu- 
lation over 15 years old. It is conducted on a quarterly basis 
by the NSO by personal interview, using previous week as 
reference period. Being part of the Integrated Survey of 
Households (NSCB 2000), the July 2000 and January 2001 
surveys used the same sample of households as the 2000 
FIES. Thus the two data sets can be merged to form a richer 
set of auxiliary variables. Additional auxiliary variables 
were also taken from the 2000 CPH in the form of 
municipal means. Census variables in both the short and 
long form were averaged at municipal level to create new 
data sets that could be merged with the set of auxiliary 
variables from FIES and LFS. 
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Presented in Tables 1, 2, and 3 are the computed esti- 
mates of the parameter (B) and the corresponding standard 
errors as well as the estimates of the variance components at 
the national, regional and provincial levels, respectively. 
Table 2 is one of the regional models of the 16 models fitted 
at the regional level (there are 16 regions in the Philippines 
in the year 2000). Similarly, Table 3 shows one of the 
provincial models of the 20 models formulated for 20 
selected provinces. To standardize comparison, exactly the 
same set of predictor variables are used for all the different 
model fitting techniques. (There are five sets of parameter 
estimates, although there are only four basic methods 
considered, because ELL is used both with and without 
heteroscedasticity.) Note that in practice when ELL is 
applied, the survey data is often subdivided and separate 
models fitted to each subsample, e.g., to each regionally- 
based stratum as the 16 regions in the Philippines or even 
provincial level models. This can lead to overfitted models 
and downwardly biased standard errors for small area esti- 
mates. For the analysis here, a single model (or the national 
level model) has been fitted. In practice intermediate models 
with some but not all possible regional effects seem to work 
best. See for example Haslett and Jones (2005). 

To assess the differences of the estimates generated from 
the different techniques, an informal comparison of the 
“significance” of the different estimates of B is conducted 
by subtracting from the estimate by one method the mean of 
the other methods’ estimates, then dividing by the standard 
error of the one method. At the national level (Table 1), 
estimates of the regression coefficients generated from the 
different methods are significantly different from each other 
for a number of the independent variables. GSR tends to 
generate estimates of the regression coefficients for the 
majority of the variables that are significantly different from 
the other methods. As pointed out earlier, the GSR estimator 
is the sample weighted regression estimator for a model 
with homoscedastic variance structure and uncorrelated 
observations in the population and hence this estimator is 
not derived under the model specified by (7). However, it is 
the most conservative as it generates the highest standard 
error for all the household level characteristics. On the other 
hand, the IWEE method has the highest estimated standard 
error for all the municipal means. The ELL_H (ELL with 
heteroscedasticity) method can be considered to be the least 
conservative since it produces the lowest standard errors for 
all the estimated regression coefficients of the household 
level characteristics as well as for the municipal means, 
except for two variables where GSR generated the smallest 
estimates. As to the estimates of the variance components, 
the ELL method generates the smallest estimated cluster 
level variance, which is about 92% of the Pseudo-EBLUP 
method and 86% of the TWEE method. As to the household 
level variance, the IWEE method generates the smallest 
estimate. 
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Table 1 
National level estimates of regression parameters with the standard errors and the variance components for the four techniques. 
*Different value for each household (mean = 0.1576633) **Based from the ELL results 


Explanatory ELL(no hetero) ELL(w/ hetero) Pseudo-EBLUP IWEE GSR 
Variables Beta Std. Error Beta Std. Error Beta Std. Error Beta Std. Error Beta Std. Error 
famsize -0.11867 0.00181 -0.12034 0.00165 -0.11875 0.00183 -0.11888 0.00180  -0.11405 0.00216 
famsizesqe 0.00937 0.00039 0.00981 0.00036 0.00938 0.00039 0.00939 0.00038 0.00898 0.00044 
type_mult 0.03876 0.01697 0.03703 0.01588 0.03699 0.01717 0.03466 0.01692 0.11460 0.02194 
per_kids -0.20342 0.01476 -0.20818 0.01322 -0.20293 0.01491 -0.20216 0.01467 -0.22864 0.01617 
roof light -0.06314 0.01291 -0.05808 0.01056 -0.06263 0.01306 -0.06175 0.01287 -0.09251 0.01413 
per_6lup -0.09402 0.01420 -0.08331 0.01371 -0.09392 0.01435 -0.09389 0.01412 -0.09705 0.01698 
roof_strong 0.05882 0.01135 0.05633 0.00962 0.05944 0.01148 0.06030 0.01132 0.03118 0.01293 
wall light -0.05459 0.01182 -0.04979 0.00975 -0.05426 0.01195 -0.05392 0.01178 -0.06286 0.01353 
wall salvaged -0.10814 0.02505 -0.11327 0.02058 -0.10748 0.02533 -0.10607 0.02495 -0.15702 0.02925 
wall_ strong 0.14248 0.01051 0.12964 0.00910 0.14274 0.01063 0.14319 0.01047 0.12662 0.01284 
fa_xs -0.17052 0.00941 -0.16756 0.00782 -0.17144 0.00952 -0.17236 0.00939 = -0.14213 0.01110 
fa_s -0.08368 0.00861 -0.08242 0.00725 -0.08403 0.00871 -0.08454 0.00857 -0.06667 0.00964 
fa_l 0.09016 0.00908 0.08478 0.00792 0.09065 0.00918 0.09106 0.00904 0.07848 0.01047 
fa_x] 0.16959 0.01104 0.15404 0.00992 0.17034 0.01117 0.17121 0.01100 0.14300 0.01334 
fa_xxl 0.27072 0.01144 0.24485 0.01094 0.27172 0.01157 0.27274 0.01140 0.23913 0.01457 
fa_xxxl 0.36190 0.01371 0.31369 0.01286 0.36270 0.01387 0.36382 0.01367 0.32123 0.02025 
all_eled 0.19084 0.01535 0.20497 0.01307 0.19031 0.01551 0.18964 0.01527 0.21344 0.01831 
all_hsed 0.42325 0.01250 0.43771 0.01083 0.42192 0.01263 0.42024 0.01244 0.48180 0.01475 
all_ coed (22591 0.01371 1.29368 0.01379 1.21324 0.01386 1.20935 0.01366 1.35022 0.01827 
dom_help 0.60207 0.01629 0.61218 0.01886 0.60035 0.01645 0.59733 0.01620 0.70307 0.02656 
head_male -0.05878 0.00988 -0.04581 0.00932 -0.05862 0.00998 -0.05819 0.00982 -0.07410 0.01173 
no_ spouse -0.09367 0.00987 -0.07376 0.00917 -0.09361 0.00997 -0.09351 0.00981 -0.09599 0.01123 
hou_9600 0.28537 0.07654 0.25643 0.07375 0.28871 0.07911 0.28783 0.08066 0.31956 0.07941 
hea_rel_mus 0.09058 0.02645 0.10859 0.02507 0.09753 0.02728 0.09731 0.02782 0.10196 0.02737 
Per_eng 0.17273 0.06529 0.14561 0.06298 0.17782 0.06754 0.17799 0.06887 0.17076 0.06407 
Hou_coelpg 0.37463 0.04348 0.39784 0.04210 0.37934 0.04494 0.37792 0.04581 0.42682 0.03711 
Hou_own_ref 0.17716 0.10497 0.18342 0.10178 0.17189 0.10843 0.17329 0.11055 0.13791 0.09766 
Hou_own_tel 1.39287 0.13356 1.42109 0.12987 1.38551 0.13723 1.38974 0.13989 1.23506 0.13019 
Per_wor_prh 0.46957 0.15484 0.40302 0.14926 0.47517 0.16006 0.47208 0.16317 0.50814 0.15210 
Per_ind_52 -0.76245 0.21708 -0.78120 0.21073 -0.76326 0.22410 -0.76307 0.22849 = -0..73294 0.21214 
const 9.54013 0.05525 9.54456 0.05290 9.53566 0.05698 9.53594 0.05791 9.52622 0.05613 
Variance 
Components HH Cluster HH Cluster HH Cluster HH Cluster HE Cluster** 
Estimate level level level level level level level level level level 
0.18461 0.04741 NA* 0.04741 0.18820 0.05172 0.18185 0.05498 0.18461 0.04741 
Table 2 


Regional level estimates of regression parameters with the standard errors and the variance components for the four techniques. 
*Different value for each household (mean = 0.18930) **Based from the ELL results 


Explanatory ELL(no hetero) ELL(w/ hetero) Pseudo-EBLUP IWEE GSR 
Variables Beta Std. Error Beta Std. Error Beta Std. Error Beta Std. Error Beta Std. Error 
famsize -0.12327 0.00760 -0.12934 0.00689 -0.12377 0.00752 -0.12380 0.00749 -0.11786 0.00997 
famsizesqc 0.01096 0.00164 0.01190 0.00147 0.01101 0.00163 0.01102 0.00162 0.01030 0.00195 
dom_help 0.81037 0.08873 0.75624 0.10986 0.80727 0.08784 0.80708 0.08751 0.84490 0.08911 
wall light -0.06808 0.04289 -0.06390 0.03743 -0.06020 0.04272 -0.05973 0.04257 -0.14472 0.04226 
wall_ strong 0.13761 0.03745 0.15212 0.03469 0.14514 0.03737 0.14560 0.03725 0.06116 0.04249 
fa_xs -0.22074 0.04910 -0.22368 0.04518 -0.22723 0.04875 -0.22761 0.04858 -0.14856 0.05665 
fa_s -0.13540 0.03840 -0.12255 0.03344 -0.13775 0.03805 -0.13789 0.03791 -0.11059 0.04538 
fa_| 0.09484 0.03709 0.08894 0.03429 0.09590 0.03676 0.09597 0.03663 0.08529 0.04122 
fa_xl 0.16627 0.04315 0.15519 0.04072 0.16938 0.04284 0.16958 0.04269 0.13698 0.04897 
fa_xxl 0.33706 0.04545 0.31196 0.04829 0.34173 0.04516 0.34201 0.04500 0.29156 0.05148 
fa_xxxl 0.33103 0.06185 0.30377 0.06029 0.33762 0.06134 0.33801 0.06111 0.26052 0.06635 
all hsed 0.33987 0.05253 0.35591 0.04783 0.33807 0.05209 0.33796 0.05189 0.35776 0.04843 
all_coed 1.21824 0.05734 1.24762 0.05842 1.20787 0.05692 1.20726 0.05671 1.32979 0.06227 
per_kids -0.24699 0.06440 -0.24047 0.05846 -0.24439 0.06371 -0.24424 0.06347 -0.27423 0.07050 
per_6lup -0.14609 0.06126 -0.15938 0.05787 -0.14703 0.06063 -0.14708 0.06040 -0.13525 0.07124 
hou_9600 1.13985 0.49103 1.27035 0.47888 1.14320 0.52137 1.14357 0.52172 1.07509 0.51937 
Hou_own_ref 1.45233 0.24550 1.51020 0.23864 1.44986 0.26072 1.44985 0.26089 1.44779 0.23585 
const 9.36877 0.20322 9.32363 0.19660 9.36597 0.21502 9.36569 0.21512 9.41385 0.21430 
Variance 
Components HH Cluster HH Cluster HH Cluster HH Cluster HH** Cluster** 
Estimate level level level level level level level level level level 
0.19544 0.03073 NA* 0.03073 0.19052 0.03728 0.18902 0.03748 0.19544 0.03073 
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Table 3 
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Provincial level estimates of regression parameters with the standard errors and the variance components for the four techniques. 
*Different value for each household (mean = 0.23749) **Based from the ELL results 


Explanatory ELL(no hetero) ELL(w/ hetero) Pseudo-EBLUP IWEE GSR 
Variables Beta Std. Error Beta Std. Error Beta Std. Error Beta Std. Error Beta Std. Error 
famsize -0.1450 0.0175 -0.1489 0.0156 -0.1452 0.0179 -0.1449 0.0171 -0.1413 0.0097 
famsizesqc 0.0090 0.0063 0.0124 0.0067 0.0091 0.0065 0.0090 0.0062 0.0085 0.0055 
fa_xs -0.4549 0.1126 -0.3816 0.1010 -0.4552 0.1149 -0.4546 0.1095 -0.4479 0.0718 
fa_s -0.2550 0.0976 -0,.2653 0.0794 -0.2545 0.0995 -0.2555 0.0951 -0.2693 0.1198 
wall light -0.2055 0.0945 -0.1474 0.0778 -0.2057 0.0965 -0.2058 0.0919 -0.2063 0.1070 
all_hsed 0.4007 0.1643 0.3531 0.1448 0.4015 0.1673 0.4006 0.1601 0.3891 0.1585 
all_coed 1.5411 0.1677 1.8202 0.1769 1.5429 0.1709 1.5429 0.1635 1.5439 0.2326 
Hou_own_tel 3.4373 1.0270 3.2630 1.0582 3.4265 1.0622 3.4274 0.9871 3.4392 0.5733 
Per_wor_prh -1.1075 1.1933 -1.5801 1.2008 -1.1049 23277, -1.1056 1.1483 -1.1150 0.8729 
const 10.0976 0.1480 10.0798 0.1279 10.0988 0.1517 10.0981 0.1435 10.0872 0.1373 
Variance 
Components HH Cluster HH Cluster HH Cluster HH Cluster HE * Cluster** 
Estimate level level level level level level level level level level 
0.25753 0.01871 NA* 0.25753 0.26682 0.02079 0.24498 0.01671 0.25753 0.01871 


At the regional level, estimates of the regression coeffi- 
cients are generally similar for all the different estimation 
methods, except that the GSR and/or ELL_H methods gen- 
erated estimates for a few variables which were significantly 
different from the other methods. Similar to the national 
level estimated standard errors, GSR also tends to be the 
most conservative method for the majority of the regional 
level models - it generated the highest estimated standard 
errors for most of the regression coefficients of the house- 
hold characteristics. IWEE has the highest estimated 
standard error for most of the coefficients of the municipal 
means. The ELL_H method produces the lowest standard 
errors for the majority of the regression coefficients of the 
household characteristics and municipal means. The ELL 
method tends to generate the smallest estimated cluster level 
variance with ratios to Pseudo-EBLUP and IWEE ranging 
from around 82% to 100%. The IWEE method still has the 
smallest household level variance. 

Similar to the regional level estimates, the regression 
coefficients’ estimates at the provincial level are similar 
except for some discrepancies from the GSR and ELL_H 
estimates. For the estimated standard errors of the regression 
coefficients, the ELL_H still produces the lowest estimates 
for the majority of the coefficients of the household 
characteristics; however, the GSR method (instead of the 
ELL_H method) now produces the lowest estimated 
standard error for the majority of the municipal means. The 
ELL method still tends to generate the smallest estimated 
cluster level variance for most provinces with the smallest 
ratio to Pseudo-EBLUP about 53% and to IWEE about 
48%. For a number of provinces, IWEE tends to generate 
the smallest estimated cluster level variance. For the 
household level variance, IWEE still generated the smallest 
estimate. Generally, estimates of the cluster level variance 
tend to be more variable at the provincial level which is due 
to smaller sample sizes. 

For small area estimates of poverty, after the regression 
model is applied to census data, estimated standard errors in 


the regression are only one part of the small area estimates’ 
standard errors. There is also variation at the cluster level in 
(7) that needs to be considered (to different degrees 
depending on the level of aggregation used to construct the 
small areas) and there is variation at household level too. 
These additional sources of variation can be assessed via the 
estimated variance components. As shown above, regardless 
of the level (national, regional and provincial) at which the 
model is formulated, the [WEE method generates the 
smallest household level variance, while the ELL method 
generates the smallest cluster level variance. Since the 
cluster level variation usually makes a much larger 
contribution to the estimated standard error at the small area 
level, ELL is again the least conservative. We note that the 
household level variance under the ELL method with 
heteroscedasticity model varies from one unit to another, 
hence, the mean value is reported, and that the estimated 
R? for the heteroscedasticity model is negligible, R* = 
0.03 even at the national level, so that in terms of regression 
model fit at least it may offer few advantages for this data 
set. In our experience with applying the ELL method we 
have found that heteroscedasticity modeling is unnecessary. 

Returning to the regression (i.e., the estimates generated 
for B and the estimated standard error for the different 
techniques), IWEE is the method that best incorporates the 
sampling weights from the computation of the variance 
components necessary for the generation of small area 
estimates and their estimated standard errors. In terms of 
implementation, the GSR method would generally be the 
simplest option as it is available for example in packages 
such as Stata, Sudaan or WesVar. The ELL method 
combines sampling weights and covariance structure in a 
way that is non-standard in that it uses an estimate of 
W,V,! in (8) and (9) to produce an asymmetric estimated 
covariance matrix for the estimates of B and for estimating 
B itself. For estimating B this would be acceptable if the 
asymmetric matrix were a generalized inverse of the correct 
covariance matrix. It is however clearly not acceptable as an 
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estimated covariance matrix, a problem ELL attempt to 
circumvent (e.g., in the World Bank’s POVMAP software) 
by averaging each of the relevant pairs of off-diagonal 
elements to meet the necessary condition that a covariance 
matrix be symmetric. 

Generally in the ELL method of poverty estimation only 
variables matching in terms of average and_ standard 
deviation in both survey and census plus census averages 
can be used. This is because, after the regression model has 
been fitted to the survey data, in the second phase it is 
applied to the census data as a predictor at household level, 
i.e., the regression equation (however it has been estimated) 
is used to find predicted values of per capita income or 
expenditure for each census household, generated via 


Yon = Xp, B+ ¥, + &,, (25) 


using imputed values of v, and e,, (based for example on 
bootstrap sampling from their survey estimates). Here x,, 
are auxiliary variables from the census. Poverty indices are 
typically based on non-linear functions of log-income or 
log-expenditure, so the predictions from (25) are trans- 
formed appropriately before averaging over each small area. 
Note that in practice v, can be estimated for the sampled 
clusters, but the sample and census codes usually do not 
match so these cannot be identified in the census, and it is 
the bootstrap (by selecting from the sampled barangays, i.e., 
PSUs) that provides imputed values for all barangays; a 
parallel comment applies to é,, for households within 
clusters. The general benefit of using census data in this way 
(as ELL does) is that the predictor variables can be used for 
all census households (of which there are many) not just 
those in the survey, thereby increasing accuracy of the small 
area estimates (conditional on the model being correct). 
Note that the estimates in (25) remain unbiased even if v, 
and e,, are not included in the prediction itself, but the 
variance estimate for small area a needs to be computed 
based on equation (25) so that it incorporates the necessary 
additional variation at cluster and household levels. 

In poverty estimation, we are interested in area-level 
summaries of non-linear functions of Yeu for example, 
whether it is below the poverty line (poverty incidence) and 
poverty gap rather than the regression fitting per se. It is 
instructive here to examine the effects of model uncertainty 
on area mean estimates 


Ye aX a (26) 


where X,, is the population (7.e., census) mean for area a of 
the covariates including the constant 1, after the regression 
model has been applied to the census data as in phase 2 of 
ELL. By similarly averaging (7) to get the true mean Y,, 


subtracting from (26), and applying the variance operator, 
we get the prediction error variance equation: 
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where NV, is the population size at a particular level of 
aggregation, N, is the population size in each cluster, ®,, is 
the variance-covariance matrix of the regression coefficient 
estimates, and (07,02) are the cluster and household level 
variance components, respectively. Note that estimating this 
prediction error variance requires estimates of the variance 
components, but any bias caused by uncertainty in these 
would be a second order effect (see Prasad and Rao 1990). 

Based on (27), the extent of the influence of the survey 
based regression model and other variance components 
(cluster and household level) on the accuracy of the final 
small area estimates can be compared for any fitting 
technique and/or levels of aggregation. Generally, it is either 
the regression model (via the estimate of the regression 
parameters) or the cluster effect that dominates the 
estimated accuracy of the computed small area estimate. 
Using the national level model in Table | and the survey 
data (instead of the census) auxiliary variables to estimate 
the first term in (27), shows that the extent to which the 
regression model effect contributes to small area estimate 
variance increases markedly as household data are more 
aggregated - about 0.25% at the municipal level, 20% at the 
provincial level and 70% at the regional level. In other 
words, the more aggregated the data into larger areas, the 
greater the dominance of the regression model parameter 
uncertainty, regardless of the regression fitting method. This 
is as expected because even at high levels of aggregation, 
the contribution to the overall variance from the model 
effect depends on the average covariate values, not on the 
population size. This is the reason that, at the most 
aggregated regional level, small area techniques usually 
offer little improvement over direct estimates. This is also 
why it is important (as this paper has done) to examine in 
detail the regression fitting procedures applied in small area 
estimation of third world poverty. 

The effect of cluster level variation is different: at lower 
levels of aggregation (e.g., municipality) the computed 
variance of the small area estimates are dominated by the 
cluster component of variance or cluster level effect, i.e., for 
small areas (other than regional estimates) the variance 
component, not the regression model, has the greatest 
impact on the value of the standard error of the small area 
estimates. Consequently, the accuracy of estimates of vari- 
ance components especially at cluster level can be crucial to 
accurate estimation of standard error of small area estimates 
at the aggregation level at which they are most useful (for 
example at municipal level in the Philippines). Again, this is 
why the method used for phase 1 fitting for variance 
components as discussed in this paper, are critical to small 
area estimation of poverty. 
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Presented in Tables 4-6 are Kruskal-Wallis (KW) tests 
(Siegel 1956) for the various fitting methods conducted on 
the estimated variances at the municipal (Table 4), provin- 
cial (Table 5) and regional (Tables 6) levels. In Table 4 
significant differences exist among the variance estimates 
generated by the various small area techniques, as shown by 
the p-values of the Kruskal-Wallis statistics. Multiple 
comparison of mean ranks shows the Pseudo-EBLUP and 
IWEE methods have variance estimates at cluster level that 
are significantly higher than the other methods, but not 
significantly different from each other (although for the 
IWEE method the Z-value for the difference from average 
rank is in general rather higher than all the others). 

The ELL method and the GSR method generate signify- 
cantly lower and similar variance component estimates. This 
is principally because we used the ELL variance compo- 
nents estimation technique in generating variance compo- 
nents for the GSR method (because GSR does not usually 
estimate variance components), although the residuals we 
used were not identical for the two regression fitting 
methods. As expected, at the municipal level for which 
small area estimates were used in practice, the cluster effect 
(rather than regression coefficient uncertainty) is generally 
the dominant part of the small area variance estimates. Since 
the ELL and GSR methods have similar cluster level 
variance, their corresponding variance estimates at small 
area also tend to be similar. Explicitly, observe from Table 4 
that the ranking of the variance estimates generally 
conforms with the ranking of the cluster effects. 

In poverty estimation, estimates at higher levels of aggre- 
gation, such as those in Table 5 and 6, are generally carried 
out for comparison with direct survey estimates at these 
more aggregated levels, even though they are not particu- 
larly useful for aid allocation. The results do however, 
support those indicated for lower level of aggregation. In 
Table 5 and Table 6, the estimated variances for the poverty 
estimates generated by the different techniques are not 
significantly different from each other at the provincial and 
regional level, an effect that is partially due to the small 
number of provinces and even smaller number of regions. 
The variances and hence the standard errors may not be 
significantly different from each other, but it is worth noting 
that the GSR method tends to generate the smallest esti- 
mated standard error for the regression model and in turn 
the smallest variance estimate for poverty at the regional 
level, even though GSR generates higher standard errors for 
the individual regression coefficients (corresponding to the 
diagonal elements only in the estimated covariance matrix 
of B). As expected, at an even higher level of aggregation 
for all methods, the relative effect of the regression compo- 
nent is more pronounced. 
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The general conclusion is that, whether fitting survey 
data alone or using survey based regression parameter 
estimates in conjunction with census data, it is crucial not 
only to find a suitable model (i.e., set of regressors) based 
on an adequate sample size, but also to get sound estimates 
of the regression parameters and their standard errors under 
this model as well as good estimates of the variance 
components at all relevant levels of aggregation. Usually the 
relevant levels of aggregation are determined via the survey 
design, rather than simply through the level at which small 
area estimates are sought, although the number of levels 
need not be limited to two (e.g., to cluster-level and 
household-level). 

Survey data, whether used for poverty estimation or in 
other context, also introduces problems involving survey 
weights that can be important not only for regression 
parameter estimation (and their estimated standard errors) 
but also for estimating variance components. Incorporating 
survey weights into regression models with correlated data 
introduces problems because it is the population correlation 
as it applies to the weighted survey data that needs to be 
properly modeled, so that weighting correlation matrices 
using matrix multiplication (as ELL do) is not technically 
adequate (see Appendix). 

For the Philippine data and for the specified list of 
regressors, regardless of which of the four methods are used, 
parameter estimates were very similar, which suggests that 
the more important issue is possible underestimation of 
standard errors of parameter estimates and of variance 
components particularly at cluster level. ELL is the least 
conservative in that it gave the lowest estimates of both 
variance measures, and in this respect (as with its use of 
asymmetric estimated covariance matrices) some caution 
may be warranted with the regression and variance compo- 
nent aspects of the ELL technique. GSR gave similar esti- 
mates of standard errors for the small area estimates to ELL 
when using the same technique for variance components, 
despite having higher standard errors (and using a sound 
covariance matrix) for regression parameters. This is be- 
cause when there is less aggregation, the level at which most 
small area estimates are actually used, variance components 
dominate. 

The Pseudo-EBLUP and IWEE methods incorporate 
survey weights correctly (given a suitable choice of pseudo- 
likelihood and hence GEE) and gave larger (i.e., more 
conservative) estimates of cluster level variance components. 
This suggests that these two methods and particularly [WEE 
are among the best of the currently available methods, not 
necessarily for estimating regression equations (where avail- 
ability of standard software may give GSR an advantage), 
but for estimating the crucial variance components. 
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Table 4 

Kruskal-Wallis test for estimated variances at the municipal level (N = 1,243) 

SAE Cluster Effect Beta Effect Variance 

Techniques Median Mean Rank L Median Mean Rank Vb Median Mean Rank Z 
ELL(no hetero) 0.002843 2,961.2(a) -3.22 0.0002311 3,067.3(ab) -0.89 0.00318 2,963.4(a) -3.18 
ELL(w/ hetero) 0.002843 2,961.2(a) -3.22 0.0002128 2,802.0(c) -6.72 0.00316 2,930.8(a) -3.89 
Pseudo-EBLUP 0.003094 3,229.4(b) 2.67 0.0002449 3,257.5(ad) 3.28 0.00346 3,241.3(b) 2.93 
IWEE 0.003294 3,426.9(b) 7.01 0.0002529 3,364.5(d) 5.64 0.00366 3,441.3(b) 732 
GSR(Stata) 0.002843 2,961.2(a) -3.22 0.0002311 3,048.7(b) =ile3} 0.00317 2,963.1(a) -3.18 
Overall 3,108 3,108 3,108 

KW Statistic H = 69.92 (P = 0.000) H=72.19 (P = 0.000) H = 78.06 (P = 0.000) 

Table 5 

Kruskal-Wallis test for estimated variances at the provincial level (N = 83) 

SAE Cluster Effect Beta Effect Variance 

Techniques Median Mean Rank Z Median Mean Rank ZL Median Mean Rank Z 
ELL(no hetero) 0.0002518 200.3 -0.65 0.0001162 207.7 -0.03 0.00039 202.3 -0.48 
ELL(w/ hetero) 0.0002518 200.3 -0.65 0.0001095 190.1 -1.52 0.00038 196.3 -0.99 
Pseudo-EBLUP 0.000274 214.9 0.59 0.0001239 224.2 1.37 0.00042 GEA 0.78 
IWEE 0.0002916 224.2 1.38 0.0001287 234.1 222, 0.00045 227.8 1.68 
GSR (Stata) 0.0002517 200.3 -0.65 0.00010 184 -2.04 0.00037 196.4 -0.98 
Overall 208 208 208 

KW Statistic H = 2.82 (P = 0.589) H= 10.61 (P = 0.031) H=4.48 (P = 0.344) 

Table 6 

Kruskal-Wallis test for estimated variances at the regional level (N = 16) 

SAE Cluster Effect Beta Effect Variance 

Techniques Median Mean Rank Z Median Mean Rank Z Median Mean Rank Z 
ELL(no hetero) 0.000050 38.2 -0.45 0.000077 40.9 0.08 0.00013 39.3 -0.23 
ELL(w/ hetero) 0.000050 38.2 -0.45 0.000073 Bal -1.05 0.00012 37 -0.67 
Pseudo-EBLUP 0.000055 42.6 0.4 0.000082 46.9 E23 0.00014 44 0.67 
IWEE 0.000058 45.3 0.93 0.000085 50.1 1.85 0.00015 46.6 Isley 
GSR(Stata) 0.000050 38.2 -0.45 0.000070 29.6 -2.1 0.00013 35.6 -0.94 
Overall 40.5 40.5 40.5 

KW Statistic H = 1.30 (P = 0.861) H = 8.36 (P = 0.079) H =2.58 (P = 0.630) 


Of course, such considerations (while central) need to be 
predicated by adequate data cleaning, sound matching of 
possible regressor variables (in terms of mean, variance, and 
meaning) between survey and census where census data is 
also being used. Also needed are the proper, time con- 
suming consideration of a wide range of possible regressor 
variables and recognition of the limits placed on subdividing 
survey data by small sample sizes, since all estimated 
standard errors for both regression parameter and small area 
estimates (whatever method is used for fitting the variance 
component estimate) are conditional on the regression 
model being correct. 


8. Conclusion and recommendation 


There is a great need for sound poverty statistics in order 
to effectively monitor interventions and assistance to 
various impoverished localities. Small area estimation tech- 
niques are one methodology that is being used to provide 
such statistics. In this sense the issues raised in this paper 
concerning the accuracy of the small area estimates are not 
simply an academic issue but are central to the Millennium 
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Development Goals and to aid allocation in what is a multi- 
billion dollar industry. 

In this paper, we have considered four estimation tech- 
niques for fitting regression models using survey data and 
related them to small area poverty estimation. We have 
shown that although differences in estimates are insufficient 
to invalidate the published national studies, the most 
frequently implemented survey data fitting technique, ELL 
with heteroscedasticity, recommended by the World Bank, 
has some limitations since (like its homoscedastic version) it 
lacks sound theoretical underpinning. Replacing the survey 
fitting part of the ELL method is recommended. For the 
other methodologies considered (the Pseudo-EBLUP, IWEE, 
and the GSR method), all have valid theoretical basis 
mathematically and the results generated can be clearly 
interpreted once the assumptions have been checked. The 
different methodologies when applied to complex weighted 
survey data from the Philippines indicate that for variance 
component estimation from survey data and hence for small 
area estimation at a fine level, Pseudo-EBLUP and partic- 
ularly [WEE are likely to be better than the GSR or the ELL 
methods, although GSR is sound and easy to use because it 
is available in off-the-shelf software. 
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We have also shown that at the level where small area 
estimation is actually used for aid allocation, the variance 
estimate of the small area tends to be dominated by the 
cluster level variance rather than by the accuracy of the 
regression parameter estimates. Hence, it is particularly 
important that the cluster-level component of variance (and, 
if fitted as recommended, any small area level variance 
component) is properly estimated. It is also important that 
the regression model used in the generation of small area 
estimates (including choice of suitable regressors) is ap- 
propriate. Essentially, at lower levels of aggregation it is the 
variance components that dominate the standard error of the 
small area estimates, so that the estimation of the variance 
components is critical whatever the choice of aggregation 
level. Sound survey-based regression method, good choice 
of regression variables, and care with sample size 
(especially if separate regression models are fitted to subsets 
of survey data), also remain central to sound small area 
estimation of third world poverty. 
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Appendix 


In footnote 8 of the Elbers etal. (2002) World Bank 
working paper and implicitly in Elbers etal. (2003) in 
Econometrica, the covariance of the error process is denoted 
Q. and it is stated that WQ°'=P’P where W is ‘a 
weighting matrix of expansion factors’. In the notation of 
Section 4 above, W is block diagonal with or diagonal with 
diagonal blocks W,, and Q is block diagonal with 
diagonal blocks V,. 

However, either W and Q (or Q"') are non- 
conformable (with weighting factors in W _ at cluster level 
and the observations and hence Q7! at individual level), or 
if conformable WQ™' is generally asymmetric (even if W 
is diagonal) unless W is a simple multiple of the identity 
matrix, i.e, W=o7I. 

Hence, WQ™' does not equal P’P as has been claimed 
since P’P is symmetric in general and WQ™' is not. 
Making WQ7!' symmetric by adding it to its transpose and 
dividing by two, as is done in the World Bank PovMap 
software, is not a technically adequate solution to this 
problem. (Note that even in the simple case where W and 
Q-' are conformable, and W _ is diagonal but not all 
diagonal elements are equal, WQ™' is not diagonal because 
it has every element of row i of Q™' multiplied by w, 
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(where w, is the i diagonal element of W) but the i" 
column does not have every element multiplied by an 
identical weight.) 

Putting this issue of symmetry to one side, and using 
P’P in place of WQ™', ELL seem to be claiming that 
comparing their ‘sample survey adjusted weighted GLS 
estimator’ to the ‘unadjusted GLS’ estimator implies that 
instead of using Q-' as the underlying metric (i.e., the 
inverse of the relevant covariance matrix), a weighted 
version namely WQ7'W7 should be used. This creates no 
asymmetry issue in itself (provided P7P were used in place 
of WQ"'). However, even if W were diagonal and P’P 
used, the weight matrix W _ cannot use even unequal 
diagonal weights corresponding to the sampled units, i.e., 
w, say, because the ij element of Q™ (unlike the ij* 
element of Q) does not correspond to the i and j™ unit 
in the sample (or in the population), so it is rather unclear 
what W is or how W can be sensibly defined as ‘a 
weighting matrix of expansion factors’. 

This argument still applies when V, is replaced by its 
estimator V, which uses estimates in place of o? and 02. 
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Small area estimation of the number of firms’ 
recruits by using multivariate models for count data 


Maria Rosaria Ferrante and Carlo Trivisano | 


Abstract 


The number of people recruited by firms in Local Labour Market Areas provides an important indicator of the 
reorganisation of the local productive processes. In Italy, this parameter can be estimated using the information collected in 
the Excelsior survey, although it does not provide reliable estimates for the domains of interest. In this paper we propose a 
multivariate small area estimation approach for count data based on the Multivariate Poisson-Log Normal distribution. This 
approach will be used to estimate the number of firm recruits both replacing departing employees and filling new positions. 
In the small area estimation framework, it is customary to assume that sampling variances and covariances are known. 
However, both they and the direct point estimates suffer from instability. Due to the rare nature of the phenomenon we are 
analysing, counts in some domains are equal to zero, and this produces estimates of sampling error covariances equal to 
zero. To account for the extra variability due to the estimated sampling covariance matrix, and to deal with the problem of 
unreasonable estimated variances and covariances in some domains, we propose an “integrated” approach where we jointly 
model the parameters of interest and the sampling error covariance matrices. We suggest a solution based again on the 
Poisson-Log Normal distribution to smooth variances and covariances. The results we obtain are encouraging: the proposed 
small area estimation model shows a better fit when compared to the Multivariate Normal-Normal (MNN) small area 
model, and it allows for a non-negligible increase in efficiency. 


Key Words: Multivariate Poisson-Log Normal distribution; Zero counts; Generalized Variance Function; Hierarchical 


Bayesian models. 


1. Introduction 


The number of people recruited by firms for a certain 
period can be taken as a key indicator of ongoing changes in 
the economic system. To highlight the dynamic of the 
demand for local labour, we consider the number of people 
recruited by firms in Local Labour Market Areas (LLMAs), 
these last grouped according to 1) productive specialization, 
ii) firms’ size classes and iii) industrial sector. Domains are 
defined by cross-classifying these three variables. In order to 
emphasise the signals of the reorganisation of the productive 
process, we focus on the numbers of “recruits replacing 
employees leaving the firm (substitute recruits — SR)” and 
“recruits filling new positions (new recruits — NR)”. In Italy, 
information about firms’ recruits is collected by the 
Excelsior Survey co-sponsored by the Union of Italian 
Chambers of Commerce (UNIONCAMERE), the Ministry of 
Labour and the European Union. Unfortunately, this survey 
does not provide reliable estimates of firms’ recruits for 
each of these domains due to small domain sample size. As 
a consequence, a small area estimation (SAE) technique has 
to be adopted in order to obtain estimates with an acceptable 
degree of variability. 

In this paper, we propose a SAE approach for the 
estimation of counts. Due to data constraints, we adopt an 
aggregated area-level model. 


Since we aim at estimating SR and NR, we adopt a 
multivariate SAE model that borrows strength not only from 
areas but also from the correlations between the NR and SR 
true values. In order to estimate the median income of 
different sized groups of families, Fay (1987) proposed a 
multivariate regression model in an Empirical Bayes 
context. Multivariate SAE approaches have also been 
developed by Ghosh, Nangia and Kim (1996) and Datta, 
Fay and Ghosh (1991), Datta, Ghosh, Nangia and Natarajan 
(1996) and Datta, Lahiri, Maiti and Lu (1999) for contin- 
uous data in the hierarchical cross-section time series model 
framework. Fabrizi, Ferrante and Pacei (2005, 2008) 
adopted multivariate area level models to estimate a vector 
of continuous poverty parameters. As in the univariate Fay- 
Herriot model (Fay and Herriot 1979), all of the papers 
mentioned above assume the use of small area normal 
sampling and linking models. 

Since the sampling correlations between SR and NR esti- 
mators are mainly negative, we propose a SAE model based 
on the Multivariate Poisson-Log Normal (MPLN) distribu- 
tion. Unlike other multivariate distributions for counts 
proposed in the literature, this particular distribution allows 
for unconstrained (that is, both positive and negative) 
correlations (Aitchison and Ho 1989). 

We also deal with the instability of estimators of sam- 
pling error variances and covariances. An approximately 
unbiased estimate of the variance of direct estimators is 
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usually available in SAE. However, in area-level models it 
is customary to assume that the sampling variance is known 
and equal to its estimate (Rao 2003; page 76). This 
assumption is commonly stated and largely accepted in the 
case of large samples, whereas both the variance estimator 
and direct point estimators suffer from instability in the case 
of small samples. As a partial solution, sampling variance 
estimates are often smoothed through the generalized 
variance functions (GVF) approach (Wolter 1985). In You, 
Rao and Gambino (2003), sampling variances and covari- 
ances were smoothed over areas and times. In order to 
consider the extra variability associated with the estimated 
sampling variances, Arora and Lahiri (1997) proposed an 
integrated Hierarchical Bayes (HB) smoothing approach for 
continuous data. See You and Chapman (2006), Liu, Lahiri 
and Kalton (2007) and You (2008) for different extensions 
of Arora and Lahiri (1997). 

Due to the rarity of recruits in certain domains, a further 
problem arises that is linked to the instability of sampling 
error variances and covariances estimators. When direct 
estimates of SR or NR (or both) are equal to zero, estimated 
sampling error variances and covariances are also equal to 
zero. Note that observing estimated variances equal to zero 
does not necessarily imply that the estimates have a high 
degree of accuracy. This problem was encountered in 
previous small area estimation problems (e.g., Elazar 2004; 
Chattopadhyay, Lahiri, Larsen and Reimnitz 1999). Chen 
(2001) proposed a unit level hierarchical modeling to handle 
the problem. Moreover, some studies (Cohen 2000) use the 
logarithmic transformation of the mean (or total) direct 
estimates of the count data in order to adopt a linear SAE 
model, simply discarding the estimates equal to zero. 
Although this solution overcomes the “zero variance” 
problem, it also leads to biased estimates and neglects a 
portion of the sample. 

In order to deal with the instability of variances and 
covariances estimators as well as the problem of estimated 
sampling variances equal to zero, we suggest an “‘inte- 
grated” approach in the spirit of that proposed by Arora and 
Lahiri (1997), Liu et al. (2007) and You (2008). Within an 
HB framework, we jointly model the parameters of interest 
and the sampling error covariance matrices by adopting a 
smoothing covariance solution based once again on the 
Poisson-Log Normal distribution. 

The layout of this paper is as follows. The data set 
employed is described in section 2, while section 3 presents 
direct domain estimation and its associated sampling error 
variances and covariances. In section 4, we describe the 
multivariate SAE model we propose for estimating counts 
as well as the solution we suggest for overcoming the 
instability of sampling error variances and covariances 
estimators in the presence of zero counts. Section 5 reports 
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the results obtained by measuring the performance of the 
adopted SAE model. Details on the Poisson-Log Normal 
distribution are given in the Appendix. 


2. The excelsior survey 


The Excelsior Survey is one of the most complete Italian 
statistical sources for labour demand data, providing esti- 
mates of the number of people recruited by Italian firms. 
Each year, a stratified simple random sample of about 
100,000 firms with at least one employee is contacted and 
asked about the number of people it plans to hire in the short 
term. The factors used for stratification are the firm’s 
industrial sector and size class. The allocation of the sample 
in the strata satisfies a constraint on the maximum estimated 
standard error corresponding to a 95% significance level 
(Baldi, Bellisai, Fivizzani and Sorrentino 2007). By focus- 
ing on local geographical details, the survey is designed to 
produce reliable estimates for the administrative provinces 
(NUTS3, following the “Nomenclature of Units for 
Territorial Statistics” reported in http://europa.eu.int/comm/ 
eurostat/ramon/nuts). This geographical unit, singled out on 
the basis of administrative criteria, does not appear to be the 
best choice when analysing the dynamics of the local labour 
demand. In order to shed some light on the signals of the 
reorganization of the local productive process, a better 
territorial subdivision would be LLMAs (following the 
OECD definition). LLMAs are groups of municipalities 
sharing the same labour market conditions (for the location 
of LLMAs in Italy, see Sforzi 1991). In Italy, following the 
strategy proposed by Sforzi and Lorenzini (2002) and 
adopted by the Italian Statistical Institute (STAT), certain 
LLMAs are labelled “industrial districts” (IDs). IDs are 
geographically defined productive systems characterized by 
a dominant specialization. In the 1990s, these were con- 
sidered to be the main stimulus for the growth of the Italian 
economic system (Becattini 1992). 

Estimating the number of substitute and new recruits in 
firms operating within/outside of IDs can help us verify 
whether IDs are still a source of dynamism for the Italian 
economy as a whole. In order to refer to types of ID, we 
group them according to their productive specialization. 
Similarly, LLMAs not labelled as IDs can be classified 
according to their economic vocation (LLMAs can be 
characterized by a specific manufacturing activity, tourist 
area, city, efc.). Moreover, the comparison between ID and 
non-ID firms makes economic sense if the industrial sector 
and size of the firms are also taken into account. Finally, as 
already noted, domains of interest are defined by cross- 
classifying: 1) groups of LLMAs obtained according to their 
productive specialization, ii) firm’s industrial sector and iii) 
firm’s size. 
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This paper focuses on the manufacturing sector charac- 
terising the IDs’ economic activity. The analysis is limited 
to two Italian regions containing a large quantity of IDs, 
namely Tuscany and Emilia-Romagna, and to firms with 
fewer than 100 employees (as censuses are taken for the 
other size classes). The target population consists of 54,089 
firms employing a total of 809,059 people. 


3. Direct estimates 


Table 1 provides details of the categories defining the 
208 domains of interest. Note that the number of domains is 
less than that expected due to the absence of a number of 
domains within the population. The domains are unplanned 
since they are formed grouping LLMAs contained in the 
same planned stratum. For the sake of simplicity, in the 
following we avoid using the stratum subscription wherever 
possible. 

Let 0, and 0,, be the true number of NR and SR for 
domain i (i=1, ..., 208), respectively. We shall first define 
a direct estimator of 0, (i=1, ..., 208; 7 =1,2). Let y,, be 
the response of the /" unit related to the 7" variable in the 
i" domain (/=1, ..., n,, where n, is the sample size in 
domain i; i=1, ..., 208; 7 =1, 2). As design based (direct) 
estimator we use a ratio domain estimator defined as 0, = 
Dir Yin /(a, 1 N;) NY, /N,, where N, and n, are respec- 
tively the population size and the sampling size referred to 
domain i, and N, =n,/n,,, N,,,, where N,,, and n,,, are 
respectively the population size and the sampling size of the 


Table 1 
Variables defining domains of interest 


vhs 


stratum ¢ containing the domain i (Sarndal, Swensson and 
Wretman 1992; page 391). 

Since we are estimating the number of occurrences of 
rare events, in 50 of the 208 domains, direct estimates of NR 
and/or of SR are equal to zero, that is, 8,,=0 and/or 
on =0(. Zero point estimates imply that V(6,, )=0 and/or 
V(0,,)=0, where V(0,) and V(0,,) are the standard 
design-based variance estimates of 6, and hep respec- 
tively. This result gives a false impression of high accuracy, 
whereas the exact opposite is more likely to be true in a 
small area context. Moreover, design based estimates of NR 
and/or of SR equals to zero produce COV(6,, ,,)=0, 
where COV (6,,, 9,,) =0 denotes the standard design-based 
estimate of the design-based covariance between 6,, and 
6,,. As a result, covariances also need to be smoothed in a 
multivariate SAE model. 

We hereafter refer to the set of the 50 small areas having 
one or both zero estimated variances and zero covariances 
as the “Zero Count” (ZC) set. The complementary set of 
158 domains, where V(0,,)>0 and V(0,,)>0, is named 
the “Non Zero Count” (NZC) set. 

Considering the data generating process and the nature of 
the outcome variables, we expect mainly negative correla- 
tions between 9,, and 9@.,. Briefly, we need a suitable 
distribution for both smoothing covariance matrices and 
modeling small area parameters that allows for an un- 
restricted covariance matrix, that is, for both positive and 
negative correlations. 


LLMAs grouped by productive specialization Firm size “ Industrial sector” 

Industrial district™ 1-9 1 Food, beverages and tobacco 

Food, beverages and tobacco 10-49 2 Textiles and clothing 

Textiles and clothing 50-99 3 Paper products, printing and publishing 
Paper products, printing and publishing 2 100 4 Machinery 


Machinery 

Jewellery, musical instruments, games, efc. 
Leather and footwear 

Wood, furniture and household equipment 
LLMAs not defined as district“? 
Non-specialised manufacturing 
Non-specialized, excluding manufacturing 
Tourist 

Cities 


5 Chemicals and basic metals 

6 Leather and footwear 

7 Wood, furniture and household equipment 
8 Jewellery, musical instruments, games, efc. 
9 Builders, contractors 

10 Other manufacturing 


(a) As defined by the 2-digit ATECO 91-ISIC 3 level classification and by Sforzi (1991). 


(b) Defined according to the number of employees. 
(c) Defined in accordance with Istat (1997). 
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4. Anintegrated multivariate small 
area model for count data 


Multivariate count data can have a non-trivial correlation 
structure. In general, the modeling of this structure signif- 
icantly affects the estimators’ efficiency and the computa- 
tion of correct standard errors. A number of multivariate 
models for count data have been proposed in the literature, 
such as the Multivariate Poisson, Multivariate Negative 
Binomial and Multivariate Poisson-Gamma Mixture models 
(for a review of such models, see Winkelmann 2003). Un- 
fortunately, these distributions are not suitable for modeling 
our data since they are based on the hypothesis that correla- 
tion is the result of an individual factor that does not vary 
across outcomes, thus implying a covariance structure re- 
stricted to non-negative correlations. In the bivariate case, a 
more flexible covariance structure is provided by the Latent 
Poisson Normal distribution (van Ophem 1999); however, 
any extensions to higher dimensional multivariate data ap- 
pear impractical. 

Aitchison and Ho (1989) proposed a d-variate distri- 
bution that allows for an unrestricted covariance structure, 
the Multivariate Poisson-Log Normal distribution (MPLN). 
No closed form exists for this distribution, but it can be 
represented as a simple mixture allowing for parameter 
estimation in an MCMC approach (Chib and Winkelmann 
2001). Details of the MPLN distribution are provided in the 
Appendix. 


4.1 Smoothing sampling covariance matrices 


As previously mentioned, the instability of standard 
errors in SAE is usually dealt with using a GVF approach. 
In this section, we present a GVF model with a regression 
function inspired by the MPLN distribution. 


Let y, =[vi- Yj2,]_ be the vector of the two outcome 
variables referring to the /" unit in the i" domain. Let 
Vi Yi! jd, ae Mak |X. pa and Yi | is x, ~ PLN, (A, z;), 


Vi, Vl. Under these hypotheses, the moments leading up to 
the second order can be expressed as follows: 


E(y,,|4;, 4;) = exp(A, = G) 4/2) iay 


Vil rj, DF )= Cy a ied (Se aelloune ) aa 1] 


COV: Vin is %,) = Si Sia lexp (6, re l], j#h 


where o, ,, denotes the (j,/), j,h=1,2, element of &,. 
To deal with the problem of smoothing covariance 
matrices, Otto and Bell (1995), suggested an approach based 
on a Wishart distributional assumption; specifically, they 
used smoothed estimates in a small area Normal-Normal 
model. In the same spirit, we propose a Bayesian approach 
using the following GVF strategy. Under simple random 
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sampling, let us assume that the sampling covariance matrix 
in domain i, C, follows a Wishart distribution with n, —1 
degrees of freedom: 


C,| n,, 0; ~W,(m, -1,T;) 


where TI, =E(Ciln,,1), 7=1,2,..., 
of C, are Heine Ast Gane eye 
where Vy =; aed Vine 

If ¢,, parameters are known, then E(C,|17,,1;) only 
dase on elements of the 2, matrix. We propose to 
estimate ¢,, using the design based estimator Ge NGOS 
Thus, we can express each element of the I’, matrix as a 
function of estimates €, and of the elements of the &, 


yf i 


158, and elements (/,/) 
a. Vii ) (Vin ie ), 


matrix: 
lies ae +0; (exp(9,,,)-1) 
Piao = Gin + Gia (€xP (0,2) -1) 
Tiga = nba xp (O93) =) 
where 6,1, =6),Z;, Oj29 =OyZ;, O,;, =8,.LZ,, being Z, 


isa 3x1 vector of dummy variables identifying the firm’s 
size Class in the domain i, and 


On 9} 22 Oh 12 
6, =| 944; |, G2) = Oo» »O = Or 49 
0311 0309 9312 


that is, we assume that parameters Z£, are equal for domains 
belonging to the same firm size class. 

We estimate 6,,,6,,,6,, parameters on NZC data. Since 
we are following a Bayesian approach, prior specifications 


for G,,, and G,,,k=1, 2,3 are ae We use the 
pi prior specifications: oe Tees core aoe 
U(-1,1), where cpy=pp(oqiopp) vandmue 


iba a uniform distribution over a subset of R* with a 
large but finite length. In section 4.3, we show how these 
estimates can be used to integrate the SAE model with a 
model for sampling error covariance matrices. 


4.2 A Multivariate Normal-Poisson-Log Normal 
small area model 


In this section, we propose a multivariate SAE model 
based on the MPLN distribution in order to jointly estimate 
SR and NR using the NZC set. 

Let 0, =(0,,,9,,)’ be the vector of the two parameters of 
nlereseslorethe mits doninmiimiemcciron NZC data 
(ie SS) wanUaler 6, be the corresponding vector of 
direct estimates. The SAE model consists of two separate 


models. The first model is a sampling model: 


6,|0,~ ind N,(0,|¥,), i=l... 


oon adh 
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As in Lahiri and Rao (1995), we justify the normality 
assumption in (1) using the central limit argument. It is 
standard practice to assume that sampling error covariance 
matrices ‘¥, are known, and a GVF method is generally 
used to estimate ‘,. Here, as a smoothed estimation of , 
we adopt ¥,=E(I|C,,n,)K,, where K,=N, (N,,,/n,,,-1). 
From this point on we will refer to YW, as Smoothed 
Sampling Error Covariance matrix (SMSEC). 

The second component of the SAE model is a linking 
model that relates 0, to area specific auxiliary data: 


O--ind PEN (je 2.) 9s = 1... D583, 
where (2) 
q, =a+ YZ, + BZ;x; 
Z, is a 3x1 vector of dummy variables identifying the 
firm’s size class in the domain i and x, = log (x. ), where 
x, is the number of employees in the domain i. 
At the end, &, is the covariance matrix related to the 
area-specific random effects: 


a=(e] (5 Yi2 ts) p=[0 Bis al 
ay) 0 Yx Y23 Bs, Bo. Bas 

From here on, we refer to this small area model as 
“Multivariate Normal-Poisson-Log Normal” (MNPLN). 

We adopt a fully hierarchical Bayesian approach. In this 
framework, relatively complex (e.g., multivariate) models 
can be implemented easily; in addition, posterior 
distributions can be approximated using MCMC algorithms. 
Computing small area multivariate estimates, and estimates 
of their MSE in particular, can be difficult within a 
frequentist approach. The specification of priors for the 
described model is as follows: 


Ql 
iN (0, al, ), 
Q, 


es sala Onan) WN 3220 


i Joa 


x,' ~W(s,1,), 


a w i } 
Vox Boy 


b= lee, 3, 
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where s=3 and a, g,, 5, are large compared with the 
scale of the data. This is to reflect the lack of prior 
information about model parameters, thus defining diffuse 
but proper specification of priors. The posterior means 
a =E(@,| 6...) are taken as estimators of the area 
parameters, while the posterior variance V(0,|6,, ¥,) is 
used as a measure of uncertainty. 

For the sake of comparison, we take the standard 
Multivariate Normal-Normal (MNN) model as a bench- 
mark, where the sampling model is defined as in (1) and the 
linking model is defined as follows: 


6, ~ ind N,(u,,Z,), (3) 


where p, =a +y Z, +P Z,x,. Parameters a’, y', B and 
their prior distributions are defined as a, y and B in the 
previous model. 


4.3 An integrated MNPLN small area model 


In order to account for the extra variability due to the 
estimated covariance matrices of sampling errors, as well as 
to overcome the zero variances and covariances problem, 
we suggest a solution in the spirit of that proposed by Arora 
and Lahiri (1997), Liu etal. (2007) and You (2008). We 
integrate the model for sampling error covariance matrices 
of section 4.1 into SAE models (1) and (2). Thus, we here 
refer to the whole set of 208 domains. 

In this context, the small area sampling model is 
formulated as usual, that is, 6,|@, ~ ind N,(0,,¥,), i= 
1, ..., 208. Under the hypotheses regarding y, formulated 
in section 4.1, assuming that the 2&,s are known and 
assuming that 0,,=N,C,, the elements of the sampling 
error covariance matrix ; can be expressed as follows: 


* 


Wi = K,[0,/N, + 9;/N; (exp(6',Z;)-D] (4) 


rant 


Don = K,[N; 8,9; (exp (6;,Z, Je 1)] (5) 


where 6’, 7 = 1,2 and 6, are posterior means of 
parameters 6, and 6,,, respectively, computed using the 
model of section 4.1. 

Since the sampling error covariance matrices are 
expressed as a function of the @, parameters, here they can 
be considered Model Based Sampling Error Covariances 
(MBSEC). The posterior means 6'"° = £(0,| ,) are taken 
as estimators of 0's, while the posterior variance V (0, | 0,) 
is used as a measure of uncertainty. 

We note that the MNN model cannot be implemented 
following the integrated approach described above. In fact, 
(3) does not ensure the positivity of @, nor of the diagonal 
elements of ‘Y, asa result. 
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5. Data analysis 


In section 5.1, we compare the MNPLN model with the 
benchmark MNN model and their univariate counterparts. 
We assume SMSEC for both models; we thus refer to the 
two strategies as MNPLN-SMSEC and MNN-SMSEC from 
here on. Since these models do not allow us to deal with the 
zero count problem, we refer this analysis to the NZC set. In 
section 5.2, we compare the SAE integrated strategy based 
on the MNPLN model and MBSEC (MNPLN-MBSEC), 
which we presented in Section 4.3, with the strategy based 
on the MNPLN-SMSEC. We limit the analysis to the NZC 
set in order to evaluate the two strategies under the same 
conditions. Finally, in section 5.3 we evaluate the overall 
performance of the proposed SAE model MNPLN-MBSEC 
for the whole data set (NZC+ZC). 

Posterior distributions of parameters were obtained for all 
models, using Monte Carlo integration via the Gibbs sam- 
pling algorithm. We used the MCMC software WinBUGS 
(Spiegelhalter, Thomas, Best and Gilks 1995) to run three 
parallel chains (each with 25,000 runs), the starting point 
being drawn from an over-dispersed distribution. WinBUGS 
codes are available at the URL http:/Avww?2.stat.unibo.it/ 
trivisano/, The convergence of the Gibbs sampler was 
monitored by visual inspection of the chains’ plots and of 
autocorrelation diagrams, and by means of the potential 
scale reduction factor proposed by Gelman and Rubin 
(1992). Although all models displayed fast convergence, we 
discarded the first 5,000 iterations from each chain. In 
multivariate models, the fairly strong autocorrelation of 
chains is reduced by thinning the chain (1 out of every 3 
values has been considered for posterior summaries). See 
Rao (2003, pages 228-232) for details. 

The performances of the small area models discussed in 
sections 4.2 and 4.3 are compared using various measures. 
In order to choose among competing models, we computed 
the Deviance Information Criterion (DIC). The DIC is a 
model selection criterion according to which a model’s 
performance is evaluated as the sum of a measure of fit (the 
posterior mean of the deviance D) and a measure of 
complexity obtained as the difference between D and the 
deviance evaluated at the parameters’ posterior mean. In this 
way, a model is preferred if it displays a lower DIC value 
(Spiegelhalter, Best, Carlin and Van der Linde 2002). 

In order to verify the strength of the multivariate ap- 
proach to SAE, we use as a benchmark the univariate 
versions of models discussed in sections 4.2 and 4.3, 
defined as follows. For all models, we set o,,, =0 in X,, 
and we assume 6,,; LO)», Oy 4 ~ U(0,0"), j=1, 2. 
For SMSEC models, we set WY, =diag(,), while for 
MBSEC models we set 6, ,, =0 in (5). In addition, a new 
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set of estimates for parameters 6,, and 6,, is obtained by 
setting p, =0 in the model of section 4.1. 

Table 2 reports the DIC results for the whole set of small 
area models. 


Table 2 

Model comparison using DIC statistic 
Model Data set DIC 
MNN-SMSEC NZC 2 MADD. 
(univariate version) NZC 2 Asie 
MNPLN-SMSEC NZC 2,656.9 
(univariate version) NZC 2,661.0 
MNPLN-MBSEC NZC 2,623.6 
(univariate version) NZC 2,638.1 
MNPLN-MBSEC NZC+ZC 3,202.7 
(univariate version) NZC+ZC 3,214.3 


All the multivariate models considered perform better in 
terms of DIC than their univariate counterparts (Table 2). In 
addition, for all multivariate models we find that posterior 
credibility intervals of p, =6,,15/,/6, ;;6,.. do not contain 
zero. We thus focus on multivariate models in the following 
paragraphs. 

We checked the adequacy of the specified multivariate 
models using posterior predictive checks. Simulated values 
of a suitable discrepancy measure are generated from the 
posterior predictive distribution and are then compared with 
the values of the same measure computed from observed 
data. Let 0,,. and 6... denote the observed and generated 
data, respectively. The posterior predictive p-value is de- 
fined as p = P{d(@,,.,, 9) > d(6,,,.8) | 6,,.}. We consider 
a discrepancy measure proposed in Datta etal. (1999), 
which is defined as 


new? 


d(6,0) =>. (6, -0,)' ¥"' (6, -8,). (6) 
= 


Computing the p-value is straightforward using the 
MCMC output. Extreme values of the probability p indicate 
a given model’s lack of fit. Following Rao (2003, page 245- 
246) and You and Rao (2002), we computed two statistics 
that are useful in order to assess model fit at the individual 
domain level. The first statistic, seh Or wees 4 (Oe 
provides information about the degree of consistent over- 
estimation or underestimation of 0, Ss 

The second statistics is defined as 


d; x [E (8, 80s ) ir Se opel V (6, 9.65 ), 
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where expectation and variance are under the posterior 
predictive distribution. Table 3 summarizes results relative 
to p, D, and d,. 

To further check the consistency of the data, we cal- 
culated direct and model-based estimates of ,0,,s = 
1, ..., 10, that is, the total number of NR and SR for the ten 
domains identified by classifying firms only according to 
the industrial sector. Let w,, =1 if the number of recruits in 
the domain 7 refers to the industrial sector s and w, =0; 
otherwise, then 


A 0. = Ds 0; Wis. (7) 


At this level of aggregation, direct estimates can be 
considered accurate. Consequently, given two sets of 
model-based estimates referring to these large domains, we 
prefer the one that agrees with the direct estimates. Domains 
identified by industrial sectors are planned in the Excelsior 
Survey; each industrial sector is stratified according to firm 
size. Therefore, direct estimates , ce for each industrial 
sector are calculated using the standard Horwitz-Thompson 
estimator. Aggregated model-based estimates are computed 
based on the MCMC output. For models referring to NZC 
data, we aggregated following (7) at each MCMC step 
t,t=1,...,7, with samples “Os and ‘0; generated respect- 
tively from the posterior distribution of 0, for domains 
belonging to the NZC set and from the predictive distri- 
bution of 8, for domains belonging to the ZC set. The HB 
estimator is defined as Ae = Tt Oienze 0%, + 
Diezc 0; w,). Otherwise, for the model on NZC+ZC data, 
we aggregated following (7) MCMC samples from the 
posterior distributions of @,. In this case, the HB estimator 
is defined as _, eee TY 14(Dienzc'9;w,,). Table 4 reports 

6. and , Oe 


summaries of 0, 


iva 


For all the multivariate models, we examined the follow- 
ing variants of the prior distributions: independent non- 
informative flat prior distributions were used for the 
elements of vectors a, B, y,@,B, and Toon SU y= 
12s one (Dor apeior. G1.) a Wesdoxthe 
same for the elements of matrix £° in the MNN model. We 
did not find any relevant changes in the posterior distribu- 
tions of parameters of interest. 


5.1 Comparing the MNPLN-SMSEC and MNN- 
SMSEC models on the NZC set 


We find that the MNPLN-SMSEC model largely out- 
performs the MNN-SMSEC one in terms of DIC (Table 2). 
This last model shows a lack of fit as it displays a p-value 
equal to 0.034 (Table 3), whereas a value of 0.65 suggests 
the adequacy of the MNPLN-SMSEC model. This finding 
is confirmed when Dp: and d, measures (Table 3) for the 
two models are compared. For the MNN-SMSEC model, 
Dy ranges over domains from 0.000 to 0.995 for NR 
(j =1) and from 0.003 to 0.993 for SR (j=2), respec- 
tively, indicating overestimation and underestimation in 
some domains. In addition, summaries of the standardized 
residuals d, indicate that there are predicted values outside 
two standard deviations of the corresponding observed 
values. The same measures for the MNPLN-SMSEC model 
indicate an adequate fit. 

We also find that the MNPLN-SMSEC model out- 
performs the MNN-SMSEC models when performances are 
evaluated with reference to estimates for large domains 
(Table 4). In fact, credibility intervals for the MNN-SMSEC 
only cover 2 aggregated direct estimates for NR and 4 for 
SR, while credibility intervals under the MNPLN-SMSEC 


cover 6 aggregated direct estimates for NR and 6 for SR. 


Table 3 
Posterior predictive checks; summaries of Pi and dj calculated with respect to i 
Model Data set p Pit Ps dij d; 
min 0.000 0.003 -3.764 -2.867 
MNN-SMSEC NZC 0.034 median 0.591 0.616 0.257 0.295 
max 0.995 0.993 2.656 -2.515 
min 0.154 0.129 -0.965 -1.165 
MNPLN-SMSEC NZC 0.65 median 0.535 0.561 0.124 0.149 
max 0.891 0.912 1.216 1.286 
min 0.090 0.134 -1.085 -0.983 
MNPLN-MBSEC NZC 0.78 median 0.515 0.519 -0.084 -0.085 
max 0.916 0.914 1.401 1.787 
min 0.072 0.111 -1.164 -0.945 
MNPLN-MBSEC NZC+ZC 0.79 median 0.506 0.523 -0.076 -0.094 
max 0.903 0.913 1.301 1.778 
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Table 4 
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Direct and HB estimates for industrial sectors; in italic HB estimates whose credibility intervals cover direct estimates 


Direct estimates 


HB estimates 


MNN-SMSEC MNPLN-SMSEC MNPLN-MBSEC MNPLN-MBSEC 
(NZC) (NZC) (NZC) (NZC+ZC) 

: 48 Se(49s1) 8 eae ne an son cee 48s" eke: 
1 1,702.0 41.3 1,077.0 964.3 1,201.0) 0.26610) 1,05520-17509:0 1,649.0 1,434.0 1,906.0 1,630.0 1,406.0 1,899.0 
2 1,758.8 41.9 1936.0) 1-793:0) 21090) W200" 144005 2610” SL 97TOF S663.05 234720) E9050 eS 95/0 2291.0) 
3 725.0 26.9 557.8 460.6 662.7 534.6 435.8 642.3 696.6 5733 842.3 682.8 WA) 811.8 
4 373.9 19.3 202.7 123.0 294.8 192.1 129.1 PPA HAD) 370.0 291.1 471.4 319.8 Dryer 408.3 
5 142.4 11.9 158.2 66.5 258.2 146.0 98.4 205071 23916 164.3 326.9 149.7 108.3 205.0 
6 5,624.1 75.0 4,134.0 3,800.0 4,484.0 5,235.0 4,814.0 5,670.0 5,537.0 5,136.0 5,963.0 5,594.0 5,187.0 6,029.0 
887.7 29.8 659.9 549.1 783.7 629.6 526.4 743.4 CW PEM 761.7 1,003.0 844.6 732.3 980.3 
8 223.9 15.0 263.3 188.2 340.6 260.6 182.8 SOS 362.0 262.8 494.1 288.7 203.1 410.8 
9 661.5 Sy H 893.7 790.3 999.4 777.6 624.7 948.7 931.0 754.8 1,150.0 803.3 638.7 1,017.0 
10 1,792.6 42.3 VAGOLOY 1533407" 1598:0) 5790" L380 1798.0 LE847.09 FT, 650:0) 2,074.0 TeS1520 SAEGTO S205 320 

48.2 S&( 49,2) 1852 east 4959 pie 1852" oat 182 ioaae 
l 942.7 300.2 482.0 428.5 SyoHlleg) 503.7 413.3 600.4 832.6 706.4 987.6 817.8 686.0 980.0 
2 920.0 135.7 883.9 798.7 967.4 849.8 694.8 1,022.0 949.8 TFB ONO 922.3 747.6 1,167.0 
3 D532 35.6 249.2 209.2 292.1 2541 202.1 309.9 338.8 269.2 423.1 284.7 226.2 354.5 
4 150.5 36.0 84.4 53.3 120.4 84.7 56.8 119.2 160.6 116.7 218.0 ISTE: 97.0 179.6 
5 39.8 16.6 66.7 Bo 104.2 62.0 Bis} 89.3 116.3 74.3 173.0 60.9 38.4 90.5 
6 2,304.0 Sule) 1°869:0) 469210) 92105420) 2207010" 1.85610) 2282309 2273.0 2060/0" 25080 W229 702 0790 e420 
if Sa2a/ 105.8 293.0 247.7 345.6 299.0 245.9 3572 471.5 402.8 Spe” 443.3 SMD SNe) 
8 80.8 325 iTS y37/ 85.7 143.5 100.5 67.7 140.3 139.5 76.7 210.4 98.0 58.5 156.9 
9 362.7 66.3 407.0 358.6 453.0 361.0 285.8 438.8 432.1 335.4 552.9 360.4 274.7 476.2 
10 856.3 70.7 661.1 598.1 722.6 714.4 614.0 824.7 855.4 740.5 984.6 832.7 719.8 964.5 


5.2. Comparing the MNPLN-SMSEC and MNPLN- 
MBSEC models on the NZC set 


Values of p, Pj and d, are approximately comparable 
for the MNPLN-SMSEC and MNPLN-MBSEC models 
(Table 3). Likewise, model-based estimates produced by 
MNPLN-SMSEC assume values very close to those 
obtained using MNPLN-MBSEC; in fact, the correlation 
between the posterior means of 0,, under the two models is 
equal to 0.98, while the same measure referring to 0,, is 
equal to 0.94. The same results arise for the correlation 
between posterior standard errors, which are 0.92 and 0.94, 
respectively. Performances of the MNPLN-MBSEC model 
in terms of agreement with direct estimates of large domains 
(Table 4) are slightly better than those of the MNPLN- 
SMSEC model: respectively, 7 direct estimates of NR and 8 
of SR are covered by the credibility interval calculated 
under this model. 

Given these results, we conclude that the fit of the 
MNPLN-MBSEC model 1s adequate. 
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5.3 Evaluating the performances of MNPLN- 
MBSEC models on the NZC+ZC set 


We observe that the performances of the MNPLN- 
MBSEC model on the whole dataset in terms of p, Ds and 
d, measures are satisfactory and comparable with those of 
the same model on the NZC data set (Table 3). Obviously, 
DIC values for the two models cannot be compared as the 
two models are estimated on different data sets. 

As can be seen in Table 4, all the credibility intervals 
calculated using this model cover direct estimates referring 
to large domains; in other words, the agreement of HB 
estimates with direct estimates is very satisfactory. This 
result can be explained by noting that zero counts are more 
probable in small domains, which are characterized by a 
small number of employees (the covariate in all models). 
Therefore, estimating models on NZC data can lead to 
biased estimates of parameter B. We conclude that inte- 
grating a sampling covariance model into the MNPLN small 


area model leads to an appreciable increase in the reliability 
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of small area estimates. To describe the efficiency gain of 
the HB estimates, we computed on the NZC set the average 
percent CV reduction (You 2008), defined as the average of 
the difference of the direct CV and HB CV (the ratio of the 
square root of the posterior variance and the posterior mean) 
relative to direct CV. The average CV reduction is 23.1% 
for NR and 29.1% for SR. 
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Appendix 


The Multivariate Poisson-Log Normal distribution 


Let y =(\, Vo» ++. Vj» ++» Ya) be a d-dimensional vector 
of counts, and suppose that y|t, ~ Po(t,), with 
y,\t;ly,|t, G47). Let the vector of parameters 
T=(U, Ty, ++) T; oe) Tq) follow a multivariate Log 
Normal, that is, t| 4, 2 ~ LN, (A,2%), where 4 = E(logt) 
and & = COV (logt). Then the marginal distribution of y 
is a Multivariate Poisson-Log Normal (MPLN) distribution, 
which is a log normal mixture of d independent Po(t,), 
that is, y|A,&U~PLN,(A,Z). By denoting — the 
(j, 4), j, h=1, 2,....d element of Z as o,,, marginal 
moments can be obtained easily through conditional 
expectation results and the standard properties of the 
Poisson and Log Normal distributions: 


E(y,|4, B)=exp(A,+o,/2)=C, 
V(y,|%, Z)=C, +6; exp(o,)-1] 
COV(),, ¥,|4,2) =C,5, lexp(o,,)-1, 7 #4. 


Note that the MPLN model allows for overdispersion 
provided that o,, >0, thus leading to V(y,|4, &)> 
E(y,| 4, &). Moreover, the correlation structure of counts 
is unrestricted, since COV (y,, y,|%, 2) can be either 
positive or negative depending on the sign of o ,,. Aitchison 
and Ho (1989), as well as Good and Pirog-Good (1989), 
studied a bivariate MPLN distribution, albeit exclusively in 
cases without covariates. However, the same model can 
easily be extended to take covariates into consideration 
(Chib and Winkelmann 2001). 
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Linearization variance estimation for 
generalized raking estimators in the presence of nonresponse 


Julia D’ Arrigo and Chris Skinner ' 


Abstract 


Alternative forms of linearization variance estimators for generalized raking estimators are defined via different choices of 
the weights applied (a) to residuals and (b) to the estimated regression coefficients used in calculating the residuals. Some 
theory is presented for three forms of generalized raking estimator, the classical raking ratio estimator, the ‘maximum 
likelihood’ raking estimator and the generalized regression estimator, and for associated linearization variance estimators. A 
simulation study is undertaken, based upon a labour force survey and an income and expenditure survey. Properties of the 
estimators are assessed with respect to both sampling and nonresponse. The study displays little difference between the 
properties of the alternative raking estimators for a given sampling scheme and nonresponse model. Amongst the variance 
estimators, the approach which weights residuals by the design weight can be severely biased in the presence of 
nonresponse. The approach which weights residuals by the calibrated weight tends to display much less bias. Varying the 
choice of the weights used to construct the regression coefficients has little impact. 


Key Words: Calibration; Nonresponse; Raking; Variance estimation; Weight. 


1. Introduction 


Survey weighting is widely used to adjust for non- 
response bias. Generalized raking estimation (Deville, 
Sarndal and Sautory 1993) provides a class of weighting 
methods which may be used when population totals of 
auxiliary variables are available. These methods can, in 
principle, remove (large-sample) nonresponse bias when the 
probability of nonresponse is related to the values of the 
auxiliary variables via a generalized linear model. 

This paper presents some theory for linearization variance 
estimation for such methods in the presence of nonresponse. 
It also reports a simulation study of the properties of alter- 
native raking estimators and associated variance estimators 
in settings designed to mimic two European surveys con- 
ducted by national statistical institutes. We consider three 
forms of raking estimator: the classical raking ratio estimator, 
the ‘maximum likelihood’ raking estimator (Brackstone and 
Rao 1979; Fuller 2002) and the generalized regression 
estimator (GREG). The first estimator has been used in 
practice in the British Labour Force Survey (LFS), the first 
survey upon which our simulation study is based. A version 
of the second estimator has been used in practice in the 
German Survey of Income and Expenditure (SIE), the 
second survey upon which our simulation study is based. 
The GREG estimator is widely used in many surveys, in 
particular in the context of nonresponse (Sarndal and 
Lundstr6m 2005). 

A number of weighting methods, which do not fall into 
the class of generalized raking methods considered here, 
have also been proposed. See Sarndal and Lundstrém 
(2005) for a historical account and Kott (2006) and Chang 
and Kott (2008) for some recent developments where the 


auxiliary variables for which population-level information is 
available may differ from those variables which are used as 
covariates in the generalized linear model for the probability 
of nonresponse. 

The primary focus of this paper is on variance estimation 
and specifically on linearization methods, for which there 
exist a number of slightly different forms of variance 
estimator in the literature. In our simulation study we shall 
compare the properties of alternative raking estimators and 
associated variance estimators with respect to the effects of 
both sampling and nonresponse. A previous simulation 
study by Stukel, Hidiroglou and Sarndal (1996) found little 
difference between two forms of linearization estimator with 
respect to sampling. However, there are reasons why non- 
response may lead to greater differences. Conditions for 
unbiasedness of raking estimation methods under non- 
response models vary between estimation methods (e.g., 
Kalton and Maligalig 1991; Kalton and Flores-Cervantes 
2003) and the choice of variance estimator may be more 
important in the presence of nonresponse (e.g., Fuller 2002, 
Section 8). 

The paper is structured as follows. The generalized 
raking estimators are defined in section 2 and, after intro- 
ducing an asymptotic framework, the bias of these esti- 
mators is considered in section 3. Linearization variance 
estimators are defined in section 4. The simulation study is 
presented in section 5, the results are discussed in section 6 
and some concluding remarks are given in section 7. 


2. Generalized raking estimation 


We consider the class of weighted estimators of a 
population total 7, = dy y,;, which may be expressed as 
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T, =>,w, y,, where y, is the value of a survey variable for 
aunit i inasample s froma population U and w, is the 
survey weight which may depend on the sample but not on 
the choice of survey variable. We suppose here that the 
sample s consists of the set of respondents remaining after 
sampling and possible unit nonresponse. Generalized raking 
is a form of weighted estimation which may be employed 
when auxiliary population information is available in the 
form of a vector 7. = x, of population totals of values 
x, of a vector of auxiliary variables, where x; is known for 
all units in s. Following Deville and Sarndal (1992), the 
weights w, are said to be calibrated if they satisfy the 
calibration equations > ,w,x,=T.. The vector T, is 
referred to as the vector of calibration totals. The class of 
generalized raking weights w, is obtained by minimising 
the objective function: 


> 4G (w,/d;), (2.1) 
subject to the weights w, being calibrated, where G(.) isa 
specified objective function which meets certain criteria (see 
Deville etal. 1993) and d, is an initial weight. We shall 
take this to be the design weight, i.e., d, =7,', where 7, is 
the probability that unit i is sampled. Deville and Sarndal 
(1992) show that (subject to G(.) obeying certain condi- 
tions), the solution of the above constrained optimisation 
problem may be expressed as: 


w, =d, F (x!A), (2.2) 


where F(u)=g (uw) denotes the inverse function of 
g(u)=dG(u)/du and i is the Lagrange multiplier which 
solves the calibration equations: 


Fae 1) ets (2.3) 


Deville and Sarndal (1992) discuss various choices of the 
G(.) function and associated F'(.) function. We consider 
the following three choices: 


linear: 


G,(u)=(1/2)(u-1), F,(u) =1+4; 


multiplicative (raking ratio): 


G,,(u) =ulog(u) —u +1, F,,(u) = exp(u); 


maximum likelihood raking: 
G,,(u) =u—1-log(u), A, @)=d- Tye 


See also Deville et al. (1993) and Fuller (2009, section 2.9) 
regarding the above terminology for these functions. With 
the linear choice of G(.), the optimisation problem has a 
closed form solution and the generalized raking estimator 
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becomes 7, = fh +(T.-T.,)'B,, the generalised regres- 


sion estimator (GREG), where ih =>).4, Vis i =>, x; 
and 


=| 
Beal arm #1 aay), (2.4) 

With the multiplicative choice of G(.), the calibrated 
estimator of 7, is the classical raking ratio estimator 
(Brackstone and Rao 1979) when 7, contains the popu- 
lation counts in the categories of two or more categorical 
auxiliary variables. For example, in the context of the 
Britain Labour Force Survey, x, denotes the vector of 
indicator variables of three categorical auxiliary variables: 
N= (Orage OF Osun ers 0 pe O 4ee Ona): WW LTONG pares 
if unit 7 is in category a of the first auxiliary variable and 0 
otherwise, 6,,=1 if unit 7 is in category b of the second 
auxiliary variable and 0 otherwise and so on. The population 
total 7. of this vector thus contains the population counts in 
each of the (marginal) categories of each of the three 
auxiliary variables. The construction of the weights for 
classical raking ratio estimation has traditionally involved 
the use of iterative proportional fitting (Brackstone and Rao 
1979). Ireland and Kullback (1968) demonstrate that this 
method converges to a solution of the above optimisation 
problem. 

The function G,,, (uw) leads to an alternative ‘maximum 
likelihood’ version of raking adjustment, when x, takes the 
same form, denoting indicator variables of categorical 
auxiliary variables. In this case, the objective function in 
(2.1) may be interpreted as a quantity which is proportional 
to minus a log likelihood in the case of simple random 
sampling with replacement (Brackstone and Rao 1979; 
Fuller 2002). 


3. Asymptotic framework and nonresponse bias 


We now consider the asymptotic properties of Ts with 
respect to both the sampling design and the nonresponse 
mechanism. We assume that the latter is such that each unit 
in the population responds, if sampled, with probability q,, 
where this probability is not dependent on the choice of the 
sample and different units respond independently. We con- 
sider an asymptotic framework defined in terms of se- 
quences of finite populations and associated probability 
sampling designs and response mechanisms (Fuller 2009, 
section 1.3), with orders of magnitude terms expressed in 
terms of n=, q,, the expected number of responding 
units, and N, the population size. We assume there exist 
positive constants K,,K, and K, such that K,< nN 'd,< 
K, and K,<q, forall i. 

We shall suppose that Horvitz-Thompson estimators of 
means are consistent for the corresponding finite population 
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means and that central limit theorems hold (as expressed 
formally in the conditions of Theorem 1.3.9 of Fuller 2009). 
In particular, we assume that the sequences and the function 
F(.) are such that there is a unique solution 1 of 


Pas il eo ad (3.1) 
U 
with 
A=A+0,(n*), (3.2) 
and that 
T, = 309; F(a) y, +O, (Nn). (3.3) 
U 


Deville and Sarndal (1992) show that = 0 under certain 
assumptions (their Result 2). However, their assumptions 
apply just to the distribution induced by the sampling design 
and include the requirement that N7'(7.,-7.) 0 in 
probability. In the case of nonreponse, however, this require- 
ment will often be implausible (c.f Fuller 2002, page 15) and 
we do not require that 1 be the zero vector. 

A key assumption which we shall make is: 


Condition C: there exists a vector « such that F(x/a) = q;". 


If condition C holds then a solves (3.1) and so X= a. It 
follows from (3.3) that jh is consistent for 7, for any 
choice of variable y if this condition holds. Thus, we may 
view condition C as a sufficient condition for the absence of 
(asymptotic) nonresponse bias. This property of Condition 
C has been discussed by Fuller, Loughlin and Baker (1994), 
Fuller (2009, page 284) and Sarndal and Lundstrém (2005, 
Proposition 9.2) for the case when F' is linear. Fuller (2002, 
page 15), Kott (2006) and Chang and Kott (2008) also 
consider estimating response probabilities using general 
models of the form q,' = F(x/ a). 

To illustrate what might happen if condition C does not 
hold, suppose that x, is just a scalar with x, =1. Then the 
unique solution of (3.1) is A = g(N/dyq,) and plim(T,) = 
N(Xuq;y;)/(Xvq;). Hence, the asymptotic nonresponse 
bias will only disappear for those survey variables which are 
‘uncorrelated’ with the response probabilities q,. 


4. Linearization variance estimation 


We now proceed to consider the asymptotic variance of 
T, and its estimation. As in the previous section, the 
variance is defined with respect to the joint distribution 
induced by both sampling and nonresponse. 

Note first that in general (and in particular for G,,(.) and 
Gy,,(.)), Iteration is needed to solve the calibration equa- 
tions. There does exist a literature (see Deville et al. 1993) 
which seeks to estimate the variance of ie after a finite 
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number of iterations. We follow instead the approach of 
Deville et al. (1993) and, for example, Binder and Théberge 
(1988) by approximating the variance of ik, by the variance 
of the ‘converged’ estimator, i.e., the hypothetical estimator 
arising from an infinite number of iterations, represented by 
var(>, W,y,), Where the w, are the ‘converged’ weights 
which solve the constrained optimisation problem in 
section 2. 

A linearization variance estimator is obtained by 
approximating var(>,w,y,) by var(>,d,z,) for a 
‘linearized variable’ z, (Deville 1999). We now seek to 
construct this variable using a large sample argument. We 
first obtain an expression for 4. A Taylor expansion of the 
left side of the calibration equations in (2.3) gives 


Dad F.Qe'A) x= Dida eae 
+ od, f (x,'0")x,x, (AA), 


where F. = F(x,A), 2X” is between A and A and FU) = 
dF (u)/du is assumed to exist. Assuming also continuity of 
f(.), the existence of lim, _,., N'X.q, f, x, x! and using 


(3.2), we have 
Nee) sae aN) x= 


N'\d, F.x,+ ND d, fxx(A-d)+ 0,7), (4.1) 


where f= f(x/A). Then, assuming lim, ,..N'dy 4, f,x,x/ 
is non-singular and using (2.3), we obtain 


Nee [x d, fx] [r J Larix, ]* 0, (n>). (4.2) 


See Fuller (2009, proof of Theorem 1.3.9) for formal details 
of how (4.1) and (4.2) may be derived and the underlying 
regularity conditions. Note that to ensure lim, ,,,.N ‘Xv q, 
f/x, x; is non-singular may require dropping redundant 
variables from x, and possibly (as in Deville and Sarndal 
1992) modifying the estimator for samples with small 
probability that result in singularity of this matrix. 

A similar argument involving the Taylor expansion of 
w, in (2.2) about A gives: 

w=d,[F.+ f,x{(A-d)]+0,(Nn'*). (4.3) 


Then, assuming the existence of necessary population 
moments so that the remainder term in (4.3) holds uniformly 
across i (Fuller 2009, Corollary 2.7.1.1.), we have 


7, =Dwy, 
Ss 


=)4] F + fxih-Aa)|y, + 0,(Nn®*) (44) 
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and hence from (4.2) and (4.4): 
(ay ye ge te -\d,F *| +0,(Nn°*), (4.5) 


where 


B= » d, fy, zp d, f, x, ae - 46) 
Note that F’ = f, =1 under the assumptions of Deville and 
Sarndal (1992) (since in this case 2 = 0 and it follows from 
the assumptions about G(.) that /(0)= /(0)=1). Hence, 
under these assumptions, expression (4.5) corresponds to 
Result 5 of Deville and Sarndal (1992), i.e., the generalized 
raking estimator is asymptotically equivalent to the GREG 
estimator. Therefore, the asymptotic variance of 7, is the 


y 


same as that of Y,d,z,, where z, is the linearized variable: 


Z = Fo Ves BX), (4.7) 


and it is assumed that B converges to a finite limit matrix 
B. An alternative derivation of this expression is given by 
Demnati and Rao (2004, section 3.4). 

For the purpose of linearization variance estimation, 7 
is treated as the linear estimator >, d.2., where . 


Yaoi feel Me} 


Nb 


=F (y, - Bx,) (4.8) 


is treated as a fixed variable. 
A number of choices of /, and B have been discussed 


in the literature. Starting with F;, the natural choice implied 
by the above argument is FE =a On, i). A simpler choice, 
however, would be to take Fi =1. Deville and Sardal 
(1992) note that, in their classical theory with A =0, these 
choices are asymptotically equivalent but they express a 
preference for the choice F, = F(x! 4). In our setting with 
nonresponse and with A=0 not necessarily holding, the 
second choice seems preferable and this is emphasized by 
Fuller (2002, page 15). Note that these two choices imply 
that Yd, 2, either takes the form Yw,(y,-—Bx,) when 
F=F(x!d) or Xd,(y,-Bx,) when F=1. We shall 
therefore refer to these choices as either w, -weighted 
residuals or d, - weighted residuals. 

Regarding 8B, it follows from our argument on the 
choices of F that f in (4.2) should be replaced by f = 
F(x! 4), giving: 

@, 6 =[.a v6 Wk, eee if x,x'J', as also proposed 


by Demnati and Rao (2004). 


Other choices are 


(ii) B= B. as in (2.4), as proposed by Deville et al. 
(1993). 

(iii) B=[X,w,y,x/][X,w,x,x/7', as proposed by 
Deville and Sarndal (1992, equation 3.4), which 
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might be more practical to compute than B for 
users of survey data files which include the w, 
weights but not the d, weights. 


The extent to which these choices differ depends on the 
choice of G(.) function. For the linear case f(u)=1 so 
that the estimators in (i) and (11) are identical. In the case of 
classical raking adjustment, f(w)=F(u)=exp(u) so that 
f. =F and d,f =w, and the estimators (i) and (iii) are 
identical. For the ‘maximum likelihood’ raking estimator we 
have F(uw)=(1—u)' and f(u)=(1—u)~ so that af, = 
w’ /d, and the three variance estimators are all distinct. 

Having determined the form of 2, in (4.8), the lin- 
earization variance estimator for 7, is obtained by esti- 
mating the variance of the linear estimator >, 4, 2,, treating 
d, and 2, as fixed. In the case of a stratified multistage 
sampling design, assuming “‘with replacement” sampling of 
primary sampling units (PSUs) within strata, a standard 
estimator of the variance (e.g., Stukel et a/. 1996) is: 


My 


pid My yee oO 
Vi Dud: Z;,) 


pa ees 


(4.9) 


where Zp, = Laan 2Znjne> Zp = Lj Znj/M, and Z;,, is the value 
of the variable defined in (4.8) for the k" individual within 
the j'” selected PSU in stratum h. This estimator remains 
appropriate in the presence of nonresponse if individual 
response in each PSU is independent of response in all other 
PSUs and if at least one individual is observed in each 
selected PSU (Fuller ef al. 1994, page 78). 


5. Simulation studies 


In order to compare the performance of the weighted 
estimators and their corresponding variance estimators, two 
simulation studies were undertaken by constructing artificial 
populations using data from the British Labour Force 
Survey (LFS) and the German Sample Survey of Income 
and Expenditure (SIE). In each case, R=1,000 samples 
were generated from these populations by first sampling, in 
a way designed to mimic the real sampling scheme after 
some simplification, and then removing nonresponding 
cases according to two nonresponse models. The first 
assumes multiplicative nonresponse which, from Condition 
C in section 3, might be expected to lead to least bias for the 
raking ratio method. The second model assumed additive 
nonresponse, which might be expected to lead to least bias 
for the GREG estimator. 

For each of the R samples, point estimates of parameters 
were calculated using the different generalized raking 
methods presented in section 2 and variance estimates were 
calculated using the different linearization methods 
presented in section 4. The properties of the estimators were 
then summarised. 
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5.1 Study based on the British Labour Force Survey 


The first study was based upon data from the March-May 
1998 quarter of the British LFS, a survey of persons living 
in private households in Britain, designed to provide 
information on the British labour market and carried out by 
the Office for National Statistics (ONS). The sample of 
approximately 58,000 households was treated as an artificial 
population. Repeated samples were drawn from this 
population in a way intended to mimic the design used for 
the LFS (ONS 1998, Section 3). Each sample consisted of 
1,211 households selected by stratified simple random 
sampling with proportional allocation across 19 strata, 
defined by region of residence. These regions were designed 
to mimic interviewer areas which defined strata in the LFS. 
In the LFS all individuals in a sampled household are 
interviewed if possible. In this simulation study, all the 
respondents in a sample household were retained, except 
those aged under 16, who are not relevant for the estimates 
of interest. 

The following two nonresponse models, based upon 
results of a study of Foster (1998), were used to determine 
whether sampled individuals responded. 


Multiplicative Nonresponse Model: 


@ =1t5 <= Liar Condon) 
x 1.13 (@f aged under 35) 
x 1.1 (af female) 


Additive Nonresponse Model: 


q,;' =1.15 + 0.20 (if London) 
+ 0.15 (Gfaged under 35) 
+ 0.10 (if female) 


where gq, is the response probability defined at the begin- 
ning of section 3 and the form of the model is chosen to 
satisfy Condition C. 

Three parameters of interest are defined for the artificial 
population: the total number of persons unemployed, em- 
ployed or inactive in the workforce. Weights were con- 
structed for responding individuals, with calibration totals 
consisting of population counts in the categories of three 
categorical auxiliary variables and with Horvitz-Thompson 
initial weights d,, as in section 2. The choice of auxiliary 
variables was designed to mimic those used in the LFS. 
However, because of the reduced scale of our artificial 
population and the consequent smaller numbers of indi- 
viduals within strata, we simplified the LFS calibration 
variables to the following three categorical factors, defining 
83 control totals: 

area of residence with 23 categories; 
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a cross-classification of sex by 10 age groups (consisting 
of single years for those between 16 and 24 and a 
separate age group for 25 or older) with 20 categories; 

a cross-classification of region (Northern England; 
London and South East; Midlands and East Anglia; 
Scotland) by sex by age in 15-year age groups (16-29, 
30-44, 45-59, 60-75 and 75 or older) with 40 categories. 


5.2 Study based on the German sample Survey of 
Income and Expenditure 


Our second study is based on the 1998 German Survey of 
Income and Expenditure (SIE), a national household survey 
conducted every 5 years by the Federal Statistical Office, to 
provide information about the economic and social situation 
of households, especially regarding the distribution of 
income and expenditure (Muennich and Schulrle 2003). We 
used data from a synthetic population of 64,326 households, 
created to represent 20% of all households from the Bremen 
region, excluding those with a monthly household net 
income of DM 35,000 or above (DM denotes the currency of 
German marks). A quota sampling design was employed for 
this survey and we have not attempted to mimic this design. 
Instead, our simulation study employs simple random 
sampling together with nonresponse. Repeated simple 
random samples of 1,340 households were drawn from the 
artificial population, representing a sampling fraction of 
about 1/48. Nonresponse models were constructed using the 
results of studies of similar surveys in Great Britain: the 
Family Expenditure Survey and the National Food Survey 
(Foster 1998). For each selected sample, the subset of 
responding households was determined by the following 
nonresponse models: 


Multiplicative Model: 


oe =144 x 1.09 (if self-employed) 
x 1.03 (if unemployed) 
x 0.97 (if employed) 
x 1.16 (ifno children in the household). 


Additive Model: 


g,' =1.44 + 0.13 (if self-employed) 
+ (0.04 (if unemployed) 

0.04 (if employed) 

0.23 (if no children in the household). 


+ 


The parameters of interest are the total household net 
income per quarter and the total household expenditure per 
quarter, computed from the finite artificial population. 

As for the LFS study, each sampled household was as- 
signed a weight. In the actual SIE the weights are constructed 
using essentially the maximum likelihood raking method by 
adjusting the sample data simultaneously to the marginal 
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distributions of several characteristics, such as household 
type, social economic status of the reference person, house- 
hold net income class and region (land). We try to mimic this 
adjustment, as far as possible, in our study. However, as for 
the LFS, because of the problem of strata with small numbers 
of households we simplify the SIE calibration variables to the 
following three categorical factors: 


- household type with 7 categories 
~ mother/father alone + Ichild, 
mother/father alone + 2 or more children, 
- couple with | child — spouse employed, 
couple with 1 child— spouse unemployed, 
- couple with 2 or more children — spouse employed, 
couple with 2 or more children — spouse unemployed, 
- other. 


- social status of the reference person with 5 categories 
- self-employed, 
- civil servant or military, 
- employee, 
~ worker, 
- unemployed, pensioner, student or other. 


household net income per quarter with 3 categories 
- 0-5,000 DM, 

- 5-7,000 DM, 

- 7-35,000 DM. 


6. Results 


6.1 Properties of point estimators 


Table 6.1 presents the properties of the point estimators 
of total unemployed in the LFS study for different 


Table 6.1 


calibration methods and alternative assumptions about 
nonresponse. The properties are assessed following usual 
practice in simulation studies. For example, the bias in 
Table 6.1 is obtained from B(T,)=E(T,)-T,, where 
EC eos he DD is the value of 7, for sample r 
and R is the number of simulated samples. We observe 
from this table that the standard error remains virtually 
constant across alternative raking methods for a given 
nonresponse model. Nonresponse leads to an increase in the 
standard error across all estimators as expected (since the 
sample size is reduced). The table does show evidence of 
nonresponse bias, which is of a similar order for each of the 
raking methods. We do not find that this bias is least when 
the estimator matches the nonresponse model (i.e., the 
GREG estimator for additive response and the raking esti- 
mator for multiplicative response) as we might have 
expected. Perhaps this is because the covariates used in the 
nonresponse models (e.g., the aged 35+ variable) are not all 
included in the calibrating variables. Nevertheless, the 
nonresponse bias is small in the sense that the root mean 
square error is very similar to the standard error in each 
case. Under nonresponse, the GREG calibration method 
generates some negative weights whereas this is avoided by 
the two raking methods, as expected. A greater number of 
very large weights are observed, however, for the ‘maxi- 
mum likelihood’ raking estimator. 

Corresponding results for the SIE data are presented in 
Table 6.2. The pattern of results is broadly similar, although 
there is now no evidence of significant nonresponse bias 
(7.e., the observed bias could be explained by simulation 
variation). The standard errors and root mean square errors 
also remain virtually constant across weighting methods for 
a given nonresponse model. 


Simulation properties of point estimators of total unemployed using data from LFS with R = 1,000 


Nonresponse Model/Point Estimator Bias (simulation Standard Root Mean Number of Number of Very 
standard error) Error Square Error Negative Weights' Large Weights” ; 

Complete Response: 

GREG 7.6 (14.3) 452.8 452.8 0 0 

Classical Raking 8.3 (14.3) 452.8 452.9 0 0 

‘ML’ Raking 9.0 (14.3) 453.3 453.4 0 1 
Multiplicative nonresponse: 

GREG -45.6 (15.8) 498.3 500.3 4 1 

Classical Raking -42.1 (15.8) 498.8 500.6 0 2 

‘ML’ Raking -39.7 (15.8) 499.4 501.0 0 7 
Additive nonresponse: 

GREG =31/.3)(lls7) 497.4 498.8 5 1 

Classical Raking -34.7 (15.7) 497.5 498.7 0 3 

‘ML’ Raking -32.4 (15.8) 498.1 499.1 0 q 


"the number of such weights across all sample units and all 1000 samples. 


* the number of weights more than 10 times the corresponding design weight. 
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Table 6.2 
Simulation properties of point estimators of total income using data from SIE with R = 1,000 
Nonresponse Model/Point Estimator Bias (simulation Standard Root Mean Number of Number of Very 
standard error) Error Square Error Negatives Weights Large Weights 
Complete Response: 
GREG eile 2s(33)1e3)) 10,477.3 10,478.7 0 0 
Classical Raking -170.6 (331.5) 10,484.1 10,485.8 0 0 
‘ML’ Raking -169.8 (331.8) 10,491.5 10,492.9 0 0 
Multiplicative nonresponse: 
GREG -495.7 (429.7) 13,586.8 13,595.8 0 0 
Classical Raking -493.8 (429.6) 13,584.6 13,593.5 0 0 
‘ML’ Raking -463.5 (429.5) 13,582.8 13,590.7 0 0 
Additive nonresponse: 
GREG -473.2 (430.5) 13,614.8 13,623.0 0 0 
Classical Raking -469.4 (430.5) IBS P@ 13,621.0 0 0 
‘ML’ Raking -439.5 (430.5) 13,613.5 13,620.6 0 0 


6.2 Properties of variance estimators 


The properties of the different estimators of the variances 
of the point estimators of the total unemployed from the 
LFS are shown in the Table 6.3 (the ‘standard error 
estimate’ in the table refers to the square root of the variance 
estimate). We make a number of observations: 


- weighting the residuals by w, rather than by d, 
reduces the bias and root mean squared error of the 
standard error estimator. The bias arising from the use 
of d, weighted residuals in the case of nonresponse is 
particularly important (as noted by Fuller 2002) but 
there are also non-negligible reductions of bias even in 
the complete response case. 

. The choice of weight used in B for the calculation of 
residuals seems to have little impact. 

- For a given nonresponse setting and choice of 
weighting the residuals, there is little difference in the 
results for the different choices of point estimator. 


The results in Table 6.3 are extended in Table 6.4 to 
consider relative bias of the standard error estimators, rather 
than their absolute bias, and to consider two additional 
parameters: total numbers employed and inactive. We see 
again that the relative bias arising from using d, weighted 


residuals can be substantial in the presence of nonresponse, 
over 20% in several cases, and that this is reduced using the 
w, weighted residuals. Again, little change is observed in 
the percent relative bias of the standard error estimators 
when different choices of weights are used in the calculation 
of B for the residuals. 

Corresponding results for the SIE data when estimating 
total income are shown in Table 6.5. Again, the pattern of 
results is broadly similar to that for the LFS data in Table 
6.3. For the complete response case, the use of w, weighted 
residuals rather than d, weighted residuals leads to modest 
improvement in bias and RMSE of the standard error 
estimators. For the nonresponse cases the improvements are 
considerable. Little change in the standard error estimators 
is observed when modifying the choice of weight used to 
compute the estimated regression coefficients. The results in 
Table 6.5 are extended in Table 6.6 to consider relative bias 
of the standard error estimators, rather than their absolute 
bias, and to consider one additional parameter: total 
expenditure per quarter. We see again that the relative bias 
arising from using d, weighted residuals can be substantial 
in the presence of nonresponse, over 35% in all cases, and 
that this is reduced using the w, weighted residuals, for 
which the relative bias never exceeds about 3%. 
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Table 6.3 
Properties of variance estimators when estimating total unemployed from the LFS (R = 1,000) 

Weighting Method w- or d- weight used Mean of Standard Bias of SE RMSE of Coverage’ of 
weighted for B in Error Estimator Estimator SE Confidence 
residuals’ residual’ (simulation s.e.) Estimator Interval (%) 

Complete Response: 

GREG d d 433.9 -18.8 (0.9) 33.4 93.5 
d w 434.3 -18.5 (0.9) 33h3) OB 
w d 442.8 -10.0 (1.0) Biko 93.8 
w w 441.9 -10.8 (1.0) 32.0 OB Ml 
Classical Raking d d 433.9 -18.8 (0.9) 33.4 93.5 
d w 434.2 -18.5 (0.9) 3333 OB) 
w d 443.0 -9.8 (1.0) 32.0 93.8 
w w 442.0 -10.7 (1.0) 32.0 93.8 
‘ML’ Raking d d 433.9 -19.4 (0.9) 33.7 93.5 
d w 434.3 -19.1 (0.9) 33.6 35 
d df 435.4 -17.9 (0.9) 33.0 OBES 
w d 443.7 -9.6 (1.0) BoD OB ei 
w w 442.3 -11.1 (1.0) 32.4 O3e/, 
w df 441.6 -11.8 (1.0) 323) OB 
Multiplicative nonresponse: 
GREG d d 385.7 -112.6 (0.9) 116.0 85.8 
d w 386.1 -112.1 (0.9) S25 85.8 
w d 489.5 -8.8 (1.2) 39.2 94.2 
w w 487.8 -10.4 (1.2) 39.2 94.2 
Classical Raking d d 385.7 -113.1 (0.9) 116.5 85.7 
d w 386.1 -112.7 (0.9) 116.1 85.7 
w d 490.3 -8.5 (1.2) 39.6 94.3 
w w 488.4 -10.4 (1.2) BoD 94.1 
‘ML’ Raking d d 385.7 -113.7 (0.9) 71 85.4 
d w 386.2 -113.2 (0.9) 116.6 85.6 
d af 387.8 -111.6 (0.9) 115.0 85.8 
w d 491.9 =7551( 173) 40.4 94.2 
w w 488.9 -10.5 (1.2) 39.9 94.0 
w df 487.5 eI Oie2) 39.8 94.0 
Additive nonresponse: 
GREG d d 386.5 -110.9 (0.9) 114.4 86.0 
d w 387.0 -110.5 (0.9) BES 86.0 
w d 489.3 -8.2 (1.2) 39.0 94.6 
w Ww 487.6 -9.8 (1.2) 39.0 94.6 
Classical Raking d d 386.5 -111.0 (0.9) 114.4 85.8 
d w 387.0 -110.6 (0.9) 114.0 85.8 
w d 490.1 -7.4 (1.2) 39.2 94.7 
w w 488.1 -9.4 (1.2) Bom 94.6 
‘ML’ Raking d d 386.5 -111.6 (0.9) 115.0 85.6 
d w 387.0 -111.1 (0.9) 114.6 85.6 
d df 388.6 -109.5 (0.9) 113.0 85.9 
w d 491.6 -6.5 (1.3) 40.0 94.7 
w w 488.6 =9'5 (1-2) 39.5 94.6 
w df 487.3 -10.8 (1.2) 39.4 94.6 


; see text following equation (4.8), where choices df, d and w correspond to B in (i), (ii) and (iii) respectively. 
~ percentage of 95% normal-theory confidence intervals containing true value. 
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Table 6.4 
Relative bias (%) of standard error estimators of unemployed, employed and inactive totals from LFS (R = 1,000) 
Weighting Method w- or d-weighted weight used for B in Relative Bias of Standard Error Estimator 
: 1 : 1 
residuals residual Unemployed Employed Inactive 
Complete Response: 
GREG d d -4.2 -3.4 0.5 
d “ -4.] -3.3 0.6 
w d -2.2 -2.2 18 
Ww “ -2.4 -2.3 iba 
Classical Raking d d -4.2 -3.3 0.7 
d w -4.] -3.2 0.8 
w d -2.2 -2.1 Del 
Ww m7 -2.4 -2.2 129 
‘ML’ Raking d d -4.3 -3.3 0.7 
d “ -4.2 -3.3 0.8 
d df -4.0 -3.1 its 
w d -2.] -2.0 a3} 
w vv -2.4 -2.2 il 8) 
Ww df -2.6 -2.3 1.8 
Multiplicative nonresponse: 
GREG d d -22.6 -22.3 -18.2 
d w -22.5 -22.2 -18.1 
w d -1.8 -3.3 1.8 
Ww w -2.1 -3.5 [ES 
Classical Raking d d -22.7 -30.6 -18.4 
d Ww -22.6 -30.5 -18.3 
“ d ail off -13.5 Ie 
Ww w -2.1 -13.7 il3) 
*ML’ Raking d d -22.8 -22.0 -18.4 
d “ -22.7 -21.9 -18.3 
d df -22.3 -21.7 -17.9 
d -1.5 -2.7 19 
w -2.1 -3.] eS 
w df -2.4 -3.3 ie 
Additive nonresponse: 
GREG d d -22.3 -21.8 -18.5 
d w -22.2 -21.7 -18.4 
w d -1.6 -2.9 ot 
w -2.0 -3.1 0.8 
Classical Raking d d -22.3 -30.2 -18.0 
d w -22.2 -30.1 -17.9 
w d -1.5 -13.3 1.8 
w w -1.9 -13.5 1.4 
‘ML’ Raking d d -22.4 -21.6 -18.0 
d “ -22.3 -21.5 -17.9 
d df -22.0 -21.3 -17.6 
w d -1.3 -2.4 2.0 
w W -1.9 -2.8 if) 
w df -2.2 -3.0 1k3 


' see text following equation (4.8), where df, d and w correspond to B in (i), (ii) and (iii) respectively. 
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Table 6.5 
Properties of variance estimators when estimating total income from the SIE (R = 1,000) 
Weighting Method w- or d- weight used for B in Mean of Bias of RMSE of Coverage’ of 
weighted residual’ Standard Error SE Estimator SE Estimator Confidence 
residuals’ Estimator (s.e.) Interval (%) 


Complete Response: 


GREG d d 10,338.8 -138.5 (6.9) 259.0 93.8 
d w 10,339.2 -138.2 (6.9) 258.8 93.8 
w d IOVS) 2) -99.5 (6.9) 240.0 94.1 
w w 10,376.8 -100.5 (6.9) 240.3 94.1 
Classical Raking d 10,338.8 -145.3 (6.9) 262.7 93.8 
d w 10,339.2 -144.9 (6.9) 262.5 93.8 
“ d 10,370.0 -106.1 (6.9) 243.1 94.0 
w w 10,376.9 -107.2 (6.9) 243.5 94.0 
‘ML’ Raking d d 10,338.8 -152.7 (6.9) 266.9 22)8) 
d w 133922 -152.4 (6.9) 266.7 eS) 
d df 10,340.3 -151.3 (6.9) 266.1 94.0 
“ d 10,378.3 -113.2 (6.9) 246.5 94.0 
“ w 10,377.1 -114.4 (6.9) 247.0 94.0 
w df 10,376.7 -114.8 (6.9) 247.2 94.0 
Multiplicative nonresponse: 
GREG d d 8,104.7 -5,482.1 (7.4) 5,487.1 75.8 
d w 8,105.5 -5,481.3 (7.4) 5,486.3 75.8 
w d 13,214.5 -372.3 (12.8) 549.7 94.5 
w w 13,210.9 =377529) (12.8) Sli 94.5 
Classical Raking d d 8,104.7 -5,479.8 (7.4) 5,484.9 75.8 
d w 8,105.5 -5,479.1 (7.4) 5,484.1 75.8 
w d 13,214.1 -370.4 (12.8) 549.4 94.5 
w w 13,210.4 -374.2 (12.8) Doles) 94.5 
‘ML’ Raking d d 8,104.7 -5,478.1 (7.4) 5,483.1 75.8 
d w 8,105.5 -5,477.3 (7.4) 5,482.3 Tyee 
d df 8,108.1 -5,474.7 (7.4) 5,479.7 75.9 
w d yl -367.6 (12.9) 549.4 94.5 
w w 13,210.6 -372.2 (12.9) Sle 94.5 
w df 13,208.9 -373.9 (12.9) 392.3 94.5 
Additive nonresponse: 
GREG d d 8,106.3 -5,508.5 (7.4) Spolse) 75.6 
d w 8,107.1 -5,507.7 (7.4) Seo Za7/ 75.6 
w d 132 07-9 -407.0 (12.8) 573.8 94.3 
w w 13,204.3 -410.5 (12.8) SSL) 94.3 
Classical Raking d d 8,106.3 -5,506.6 (7.4) SD, IRG 75.7 
d w 8h MOH -5,505.9 (7.4) 5,510.9 Weil 
w d IS20i7 -405.3 (12.8) 573.6 94.1 
w w 13,203.9 -409.0 (12.8) 575.8 94.1 
‘ML’ Raking d d 8,106.3 -5,507.2 (7.4) SpolZeD Te 
d w 8,107.1 -5,506.4 (7.4) 5,511.4 75.9 
d df 8,109.7 -5,503.8 (7.4) 5,508.8 75.9 
w d 13,208.9 -404.6 (12.9) 574.8 94.1 
w w 13,204.2 -409.2 (12.9) Sis 94.1 
w df PAO SS -411.0 (12.9) 578.1 94.1 


see text following equation (4.8), where choices df, d and w correspond to B in (i), (ii) and (iii) respectively. 
“ percentage of 95% normal-theory confidence intervals containing true value. 
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Table 6.6 
Relative bias (%) of variance estimators of expenditure and income totals from SIE (R = 1,000) 
Weighting Method w- or d-weighted weight used for Relative Bias of Standard Error Estimator 
é 1 ae ; 1 
residuals B in residual Expenditure Income 


Complete Response: 


GREG d d 0.7 -1.3 
d w 0.7 -1.3 
w d ihe) -1.0 
Ww w es -1.0 
Classical Raking d d 0.7 -1.4 
d w 0.7 -1.4 
w d 1.2 -1.0 
Ww w Wey -1.0 
‘ML’ Raking d 0.6 -1.5 
d 0.6 -1.5 
d df 0.6 -1.4 
Ww Ie -1.1 
w w i 2) -1.1 
w df He, -1.1 
Multiplicative nonresponse: 
GREG d d -38.2 -40.4 
d w -38.2 -40.3 
w d -0.3 -2.7 
w w -0.3 -2.8 
Classical Raking d d -38.2 -40.3 
d w -38.2 -40.3 
w d -0.3 -2.7 
w w -0.3 -2.8 
‘ML’ Raking d d -38.2 -40.3 
d w -38.2 -40.3 
d df -38.2 -40.3 
w d -0.3 -2.7 
w w -0.3 -2.7 
w df -0.4 -2.8 
Additive nonresponse: 
GREG d d -38.1 -40.5 
d w -38.1 -40.5 
w d -0.2 -3.0 
w w -0.2 -3.0 
Classical Raking d d -38.1 -40.5 
d w -38.1 -40.5 
w d -0.2 -3.0 
w Ww -0.2 -3.0 
‘ML’ Raking d d -38.2 -40.5 
d w -38.2 -40.5 
d df -38.1 -40.4 
“ d -0.2 -3.0 
w w -0.3 -3.0 
w df -0.3 -3.0 


' see text following equation (4.8), where df, d and w correspond to B in (i), (ii) and (iii) respectively. 
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7. Conclusions 


The simulation study showed little difference between 
the bias or variance properties of the three calibration 
estimators considered: the GREG estimator, the classical 
raking estimator and the maximum likelihood raking 
estimator. Some small differences in the distribution of 
extreme weights were observed: the maximum likelihood 
raking estimator had the most very large weights and the 
GREG estimator was the only one with a few negative 
weights. 

Amongst the variance estimators, the main finding was 
the contrast between the approach which weights residuals 
by the design weight and that which weights them by the 
calibrated weight. It was found that the latter variance 
estimator always had smaller bias and that this effect was 
very marked in the presence of nonresponse, when the 
former estimator could be severely biased. The bias of the 
latter estimator was generally small and the coverage level 
of the associated confidence intervals was generally close to 
the nominal coverage. 

Alternative ways of weighting the observations in 
constructing the regression coefficients, when calculating 
the residuals in the linearization variance estimator, were 
considered but little effect was observed and there was no 
evidence that this choice is important in practice. 

In general, the findings for the categorical variables in the 
British Labour Force Survey were remarkably similar to the 
findings for the continuous variables in the German Income 
and Expenditure survey. 
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Linearization variance estimators 
for model parameters from complex survey data 


Abdellatif Demnati and J.N.K. Rao | 


Abstract 


Taylor linearization methods are often used to obtain variance estimators for calibration estimators of totals and nonlinear 
finite population (or census) parameters, such as ratios, regression and correlation coefficients, which can be expressed as 
smooth functions of totals. Taylor linearization is generally applicable to any sampling design, but it can lead to multiple 
variance estimators that are asymptotically design unbiased under repeated sampling. The choice among the variance 
estimators requires other considerations such as (i) approximate unbiasedness for the model variance of the estimator under 
an assumed model, and (ii) validity under a conditional repeated sampling framework. Demnati and Rao (2004) proposed a 
unified approach to deriving Taylor linearization variance estimators that leads directly to a unique variance estimator that 
satisfies the above considerations for general designs. When analyzing survey data, finite populations are often assumed to 
be generated from super-population models, and analytical inferences on model parameters are of interest. If the sampling 
fractions are small, then the sampling variance captures almost the entire variation generated by the design and model 
random processes. However, when the sampling fractions are not negligible, the model variance should be taken into 
account in order to construct valid inferences on model parameters under the combined process of generating the finite 
population from the assumed super-population model and the selection of the sample according to the specified sampling 
design. In this paper, we obtain an estimator of the total variance, using the Demnati-Rao approach, when the characteristics 
of interest are assumed to be random variables generated from a super-population model. We illustrate the method using 
ratio estimators and estimators defined as solutions to calibration weighted estimating equations. Simulation results on the 


performance of the proposed variance estimator for model parameters are also presented. 


Key Words: Calibration; Ratio estimators; Total variance; Logistic regression; Weighted estimating equations. 


1. Introduction 


In survey sampling, estimation of a finite population total 
Y= >), y, =Y() is often of interest, where N is the size 
of the finite population. For a general sampling design with 
positive inclusion probabilities 7,, a custumary design 
unbiased estimator of the total Y is given by Y= 
Dies ¥/T; = Dp 4, (s)y,, where s is a sample, d,(s)= 
a,(s)/m, are the design weights with a,(s)=1 if kes 
and a,(s)=0 otherwise. We use operator notation and 
write Y(z)= ne d,(s)z, so that Y =Y(y). Henceforth, 
all the sums are considered on the whole population and 
hence,.write. >, yp = Diyy and ¥(z)=D5d,(s)Z,s..1to 
simplify the notation. Again, using the operator notation, we 
denote an unbiased estimator of the variance of Y(z) as a 
quadratic function, O(z), in the z, ’s. 

More complex estimators of a total Y based on known 
population auxiliary information, such as ratio and 
regression estimators, and estimators of more complex 
parameters obtained as solutions to sample weighted 
estimating equations, such as estimators of “census” 
logistic regression coefficients, are also often used in 
practice. Estimators that can be expressed as a general 
functional 7(M) have also been studied, where M 
denotes a measure that allocates the weight d,(s) to y,; 


for example, T(M) = | xdM(x) = dd,(s)y, if the popu- 
lation parameter is the total 7(/) = |xdM(x)=Y, where 
the measure M allocates a unit mass to each y, (Deville 
1999). Large-sample estimation of the variance of such 
complex estimators, 6 say, has received considerable 
attention in the literature. In particular, Taylor linearization 
methods of estimating the variance of 6 are generally 
applicable to any sampling design that permits an unbiased 
variance estimator Q(z) of i (z). Binder (1983) studied 
estimators @ that are solutions to weighted estimating 
equations and applied Taylor linearization to obtain a 
variance estimator that can be expressed as 9(Z), where the 
linearized variable Z, depends on unknown parameters, and 
2, is replaced by an estimator z, that may be based on the 
substitution method. Deville (1999) derived a Taylor 
linearization variance estimator of the functional 7 (M ) as 
9(Z), where Z,= I,(M;y,) denotes the influence 
function of 7 at y,, and then replaced 2, by the sample 
estimator z,,=/,(M; y,). For example, when 6 is the 
ratio estimator (Y/X)X =RNX of the total Y, where 
X =Y(x) and X =Y(x) is the known total of an auxiliary 
variable x, we get 2,= y,—Rx, and z,,=y, —Ry,. 
However, z, =(X/X) (y, —Rx,) is also a candidate to 
estimate Z, and the resulting 9(z) is often preferred over 
9(z, ); see Demnati and Rao (2004). Thus the choice of an 
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estimator of Z, is somewhat arbitrary under Deville’s 
approach. 

Demnati and Rao (2004) studied general estimators that 
can be expressed as smooth functions of the weights 
d(s) ={d,(s), .... dy(s)}’, say 8= f(d(s)), and obtained 
a Taylor linearization variance estimator directly as 9(z) 
with known linearized variables z, = 0f(b)/ Ob, \y_-a(s) 
without estimating 2, first and then replacing it by an 
estimator. For example, in the case of the ratio estimator 
their method automatically leads to z, given above. This 
method can be applied to a variety of estimators including 
estimators of “census” logistic regression parameters based 
on calibration weights (Demnati and Rao 2004). Previous 
work on direct variance estimation includes Binder (1996). 

When analyzing survey data, the population values y,, 
k=1,..., N, are often assumed to be generated from a 
super-population model, and the user is often interested in 
making inferences on the model parameters. Let 0, be a 
“census” parameter, i.e., an estimator of a model parameter 
8 when the population y, -values are all known, and let 6 
be a design-unbiased estimator of 0,, the “census” 
parameter. Suppose that 6 is design-model unbiased for 0, 
Cs cine a A where E,, and E,, respectively denote 
the expectations with respect to the design and the model. 
Then the total variance of 6 is V(6) = i (8-0) which 
can be decomposed as 

V (6) =E,,V,(8)+V,, (Oy), (1.1) 
where V,,(8) = E(6 =O), )’ is the design variance of 6 and 
V,(0,,) is the model variance of 8,,. It follows from (1.1) 
that the total variance may be estimated using a design- 
based estimator of V(8) if the last term: “V(05,) 1s 


negligible relative to E,, V,(0). In that case, the distinction 
between 9, and 9 can be ignored (Skinner, Holt and 
Smith 1989, page 14). On the other hand, it is necessary to 
estimate the total variance V(6) when the model variance 
V,(8,) is not negligable relative to E,,V,,(8). This 
requires consideration of the joint design and model random 
processes. Molina, Smith and Sugden (2001) argued that the 
combined process of generation of the finite population and 
selection of the sample should be the basis for analytical 
inferences on model parameters. Rubin-Bleuer and Schiopu- 
Kratina (2005) have provided a mathematical framework for 
joint model and design-based inference. However, a broadly 
applicable method is needed for the estimation of total 
variance. The main purpose of this paper is to provide such 
a method, by extending the Demnati-Rao approach for finite 
population parameters. 

In Section 2, we consider the case of a scalar parameter 
8 and present linearization variance estimators by 
expanding the Demnati and Rao (2004) approach. The 
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method is illustrated for the special case of a ratio estimator 
of a super-population mean 9. Results of Section 2 are 
extended in Section 3 to estimators of a vector parameter 0 
obtained as solutions to weighted estimating equations, and 
the method is illustrated for the special case of parameters of 
a logistic regression model. Simulation results are also 
presented. 


2. Scalar model parameter 


2.1 Point estimators 


Consider a finite population U of N elements, and let 
d,(s)=a,(s)/m, be the design weights attached to the 
population element &, where a,(s)=1 if element k is in 
the sample s and a,(s)=0 otherwise, and m, is the 
inclusion probability associated with k. We consider 
estimators 9 of a scalar parameter @ that can be expressed 
as functions of random variables under the design and the 
assumed model. In particular, 0= f(A,), where A, is a 
(p+1)xN matrix with columns d,=(d,h,,d,hyy, +++ 
ose) =o» Upuye) where d, =d,(s) is. ran- 
dom under the design, 4,, =1, and h, (i=2,..., +1) are 
random under the model. 

For example, consider the ratio model with fixed 
covariates x,: 


Bey as Bing, UAVS Sa COVA YD 
[eh ie eat A We epee CD 
where E,,V,,, and Cov,, denote model expectation, model 
variance, and model covariance respectively and o° > 0. 
Suppose that we are interested in estimating the super- 
population mean 0=E, (Y)=N'DE,(9,)=BX where 
Y is the finite population mean of y. In this case, a ratio 
estimator of @ is given by 
eS Sewanee (2.2) 
where Y=)d,(s)y, and X =yd,(s)x, are the design- 
unbised estimators of the totals Yand X, and Y is the 
know population mean of x. We can write the ratio 
estimator (2.2) in the form 6= X(Xd,,)/Dd,,x,, where 
ad, =4a,(s) and d,, =d,(s)y,. This 1s a special case of 
J(A,) with p=1 and h,, = y,. 

Let E,, be the design expectation and E=E,,E,, be the 
total expectation. Then, we have E(d,,)=£,,(1)=l=p, 
and E(d,,)=£,,(g,) =Hy 1=2,... pti, notmg that 
E,(d,(s)) =]. We assume that f(A) ='0) where Al isa 
(p+1)xN-~ matrix with columns p, =(Hy,, Ho,,-- 
LL meth Hence, 6 is asymptotically pm -unbised for 0. 
In the special case of the ratio estimator, we have 
f(A.) =BX = 0, noting that u,, =1 and p,, =Bx,. 
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2.2. Linearization variance estimator 


We first derive an estimator of the total variance of a 
linear estimator U = Sujd,, where u, is a vector of 
constants. The total variance of U may be decomposed as 

VU) = E,V,U)+V,£,U0)=1+ I, (2.3) 
where V, and V,, denote design variance and model 
variance respectively. A design-unbiased estimator of the 
component / of the total variance (2.3) is obtained by 
estimating the eee variance cs U ) for fixed 
fy, = (Igy +s Mpatye)» Now, noting that. U => b.d,(s) is 
the standard Narain-Horvitz-Thompson (NHT) estimator of 
the total U = b, when b, =u; h, are fixed conditionally, 
we can use either the Sen-Yates-Grandy (SYG) variance 
estimator for fixed sample size designs or the Horvitz- 
Thompson (HT) variance estimator for arbitrary designs. 
The SYG estimator is given by 


est(1) = 9sy,(U) 


Regaine “Ty by, (2.4) 


Tl, 


where d,,(s)={a,(s)a,(s)}/m,, and m,, 1s the inclusion 
probability for units k and ¢ (k#t). The HT variance 
estimator is given by 


est(1)= Syn) = DY, di (0) #4, , 


| aaa 3 


(2.5) 


where d,,(s)=d,(s). For the special case of stratified 
random sampling (2.4) and (2.5) are identical. 

Turning to the component J// of the total variance (2.3), 
we have VE, (j)=V, Cu, h,) =>. u, Cov, (h,,h,) u, 
and a pm -unbiased estimator is therefore given by 


esl )=0) > dpls)u, cove(h,1.)u, (2.6) 


after replacing Cov,,(h,,,) by an estimator cov,, (h,, h,). 
The estimator of total variance (2.3) is now given by 
est(/)+est(/7). We denote it, in operator notation, as 
Ou). 

We now turn to the estimation of total variance of 6. 
Following Demnati and Rao (2004), a Taylor expansion of 
6—@ may be written as 


6-O~ > % (d, — 4) 


where 2, = Of(A,)/Ob, |4,-4. and A, is a (p+l)xN 
matrix with k" column b,, a vector of arbitrary real 
numbers. The approximation (2.7) is valid for any 6 that 
can be expressed as a smooth function of estimated totals. 
Following Demnati and Rao (2004), a linearization 
estimator of the total variance is now given by 


(2.7) 


Sop (6) = 9(z), (2.8) 


which is obtained from S(u) by replacing u, by the 
“linearized variable” z, = 6f(A,)/ Ob, |4-4,- A rigorous 
theoretical justification of (2.8) follows along the lines of 
Deville (1999). 


2.3 Special case of ratio estimator 
For the ratio estimator 6 = X R of the model parameter 


0=BX, z, reduces to 


= (AT RO CER al) (2 eee (2.9) 


Further, 5, in (2.4) or (2.5) is replaced by 


zh, = Zip + 244 Vy 
(XM) ae ee 
using (2.9). Also, replacing u, by z, in (2.6) we get 


T 
Z, COV, (Itz, Mt, ) 2, = 24424 COVn (Ves Yr )- 


Under the ratio model (2.1) with unspecified model variance 
VG, y= or, = UN, “we “can “estimate ot = 
E_.(y, — Bx,)’ by (v,—Rx,)° and letting cov, (y,, y,) = 
Opfork 47. 

We now study the special case of simple random 
sampling without replacement. In this case, both (2.4) and 


(2.5) reduce to 
coral : 
I)=| — l-— |s° 2.10 
est(/) (=) + | (2.10) 
where se = hGH (se; /(n—1), and (2.6) reduces to 
(@-l) 1) <2 
I 2 
est(//) = (2) ON (2.11) 


Hence, using (2.10) and (2.11), the variance estimator (2.8) 
reduces to 


Gop (9) = est() + est(Z7) 


(FV iN-1 
sa ON 


It is interesting to note that the ““g-weight” X/X appears 
automatically in 9, (8), given by (2.12), and that the finite 
population correction 1—n/N is absent in 9,, (6) unlike 
in est(/) given by (2.10). 

In the customary approach to the estimation of total 
variance (see e.g., Korn and Graubard 1998) Vv (6) is first 
written as 


(2.12) 
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V(6)=E_V.(6)+V.E. (6) 


m” p m—~p 


~E V_(6)+V_(Y) 


mp m 


=E,V,(6)+N?> E,,(%,-Bx%,), (2.13) 


m p 


under the ratio model with unspecified o;, k =1,..., N. The 


A 


first term E,,V,,(0) in (2.13) is then estimated by a design- 
consistent estimator of V,,(0), typically by (2.10) without 
the g-factor (X / ¥)°. The second term is estimated by 
NY d,(s)(y, —Rx,) =(nNY'(n-1)s2. The sum of 
the two estimated terms then equals (2.12) without the 
g-factor. We denote this customary variance estimator by 
9.,.(8). On the other hand, if (2.10) with the g-factor is 
used to estimate V,(0), the sum of this estimated term and 
the previous estimator of the second term leads to a 


“hybrid” variance estimator 


G.;.(0) = est(1) + (nN) '(n — 1)82, 


mix ( 
where the g-term is absent in the last term. It is clear from 
the above results that the choice of estimator of total 
variance under the customary approach is not unique, unlike 
under the proposed approach. 


Value 
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If the parameter of interest is B=0/X instead of 0, 
then B=0/X¥=R and 9,,.(f) under simple random 
sampling is give by 


5 e514) 


The customary approach leads to the same variance 
estimator, (2.14). 


2.4 Simulation study 


We conducted a small simulation study to examine the 
performances of different variance estimators, both un- 
conditionally and conditionally on . We first generated 
R =2,000 finite populations {y,,..., vy} each of size N = 
393, from the ratio model 


je he oe Se (2.15) 


with independent values , generated from N(0,1), where 
the fixed x, are the “number of beds” for the Hospitals 
population studied in Valliant, Dorfman and Royall (2000, 
page 424-427). One simple random sample of specified size 
n is drawn from each generated population. Our parameter 
of interest is 8 =X, where B =2. 


Sample size 


—*#— Simulated MSE 
Sampling Component 


Figure 1 


-—¢--- DR var. est. 


Averages of variance estimates for selected sample sizes compared to estimated 


MSE of the ratio estimator. 9,,, = DR var. est., 9, = Sampling component: ratio 


model 
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Simulated total MSE of the ratio estimator 6 = X(y/x) 
is calculated as M(6) = R™! a 0) —6)*, where 6. is 
the value of 6 for the r" simulated sample and (y, X) are 
the sample means. We calculated the total variance estimate 
9p (8), and its components 9, = est(/) and 9, =est(//) 
from each simulated sample r and their averages 9,,,, 9,, 
and 9 over r. Figure 1 gives a plot of the average of 
variance estimates, 9,, and 9,, and the simulated total 
MSE for n= 20, 40,...,380,393. In the case of n=N, 
9, =0. It is seen from Figure 1, that 9, is approxiamatly 
unbiased, whereas 9, leads to severe underestimation as the 
sample size, n, increases. 

We also examined the conditional performance of the 
variance estimators under simple random sampling given 
x, by conducting another simulation study for inference on 
8, using model (2.15). The study is similar to the study of 
Royall and Cumberland (1981) for inference on the finite 
population mean 6,=Y from a_ fixed population 
{V,,-.¥y}- We generated R=20,000 finite populations 
{V,>---> Vy}, each of size N = 393 from (2.15) using the 
number of beds as x,, and from each population we then 
selected one simple random sample of size n=100. We 
arranged the 20,000 samples in ascending order of 
X -values and then grouped them into 20 groups each of 
size 1,000 such that the first group, G,, contained 1,000 
samples with the smallest x -values, the next group, G,, 
contained the next 1,000 smallest x -values, and so on to get 
G,,....G). For each of the 20 groups so formed, we 
calculated the average values of the ratio estimates 


6 = ¥(y/x) and the mean estimates jy, and the resulting 


20% - 
15% 
10% 


5% 


Relative Bias 


5% - 


-10% 


-15% 
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conditional relative bias (CRB) in estimating 8 =2X; see 
Figure 2. It is clear from Figure 2 that y is conditionally 
biased unlike 6: negative CRB (-14%) for G, increasing to 
positive CRB (+14%) for G,,. Note that both 7 and 6 are 
unconditionally unbiased for 8. The conditional bias of 0 
and jy in estimating the model parameter 0 is similar to the 
conditional bias in estimating the “census” parameter 
8, = Y, as observed by Royall and Cumberland (1981). 

We also calculated the conditional MSE of 6 and the 
associated CRB of the variance estimators 9,,, 9,,, and 
3.nix based on the average values of 9,,, 9,,, and 9,,;, in 
each group; see Figure 3. It is evident from Figure 3 that 
CRB of 3,,, ranges from -28% to 20% across the groups 
whereas 9, exhibits no such trend and its CRB is less 
than 5% in absolute value except for G, and G,,. Also, the 
CRB of 3.,,, is largely negative and below that of 3,, for 
the first half of the groups and above for the second half, but 
Gnix exhibits no visible trends unlike %.,,... 

Figure 4 reports the conditional coverage rates (CCR) of 
normal theory confidence intervals based on Opp, 9... 
Gi. and 9, (ignoring the component 3) for nominal 
level of 95%. As expected, the use of G, leads to severe 
undercoverage because the sampling fraction, 100/393, is 
significant. On the other hand, CCR associated with 9,, is 
closer to nominal level across groups, while 9,,, exhibits a 
trend across groups with CCR ranging from 91% to 97%. 
Further, CCR associated with 9.,. is slightly below that of 
Sor for the first half of the groups but 9,;, and 9), 
perform similarly. 


cus 


Groups 


—u@jeeee Mean 


oomiliiexon ratio estimate 


Figure 2 Conditional relative bias of the expansion and ratio estimators: ratio model 
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Figure 3 Conditional relative bias of variance estimators 9,,, 9 
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Figure 4 Conditional coverage rates of normal theory confidence intervals based on 9,,, 9 
, for nominal level of 95%: ratio model 
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3. Calibration weighted estimating equations 
3.1 Estimators of model parameters 


Suppose that the super-population model on_ the 
responses y, is specified by a generalized linear model 
(McCullagh and Nelder 1989) with mean E (y,)= 
u, (0) = h(x; 9), where x, isa px1 vector of explanatory 
variables, 8 is the p -vector of model parameters and /(.) 
is a “link” function. For example, /(a)=a gives a linear 
regression model and h(a) =e" /(l+e") gives a logistic 
regression model for binary responses },. 

We define census estimating equations (CEE) , based on 
estimating functions /,(@), as 1(0)=>/,(0)=90 with 
E,,/,(8)=0, and the solution to CEE gives the census 
parameter vector 0,,. For example, /, (8) = x, (7, — pH, (8)) 
for linear and logistic regression models. We _ use 
generalized regression (GREG) weights w,(s)= 
d,(s)g,(d(s)), where the ““g-weights” are given by 


g,(d(s))=1+(T-PY [SD d(s)e,tyt? | egty, 


for specified c,, where T = Dd,(s)t, is the HT estimator 
of the known total T of a gxI1 vector of calibration 
variables ¢, and d(s) is the N x1 vector of the weights 
d,(s). The GREG weights, w,(s), have the calibration 
property >i w,(s)t, =T and lead to efficient estimators 
Y=) w,(s)y, of totals Y=>y,, when y, and ¢, are 
linearly related (Sarndal, Swensson and Wretman 1989, 
chapter 6). 

We use the calibration weights, w,(s), to estimate the 
CEE. The calibration weighted estimating equations are 
given by 


1(8) = >) w, (5), (8) =>) 4, (5) g, (d(s)) 2, (0) =0. 3.1) 


The solution to (3.1), obtained by the Newton-Raphson- 
type iterative method, gives the calibration-weighted 
estimator 6 of @, and 6 is approximately design-model 
unbiased for 0, i.e., E(6) ~ 9. It follows from (3.1) that 0 
is of the form f(A,) with d, =(d,(s),d,(s)d/(®))’, 
where f(A,)is a px1 vector and A, isa (p+l)xN 
matrix with k" column d,. Here we have /, =1 and 
(Cy 45 +5 Mose) = 4 ©). 


3.2 Linearized variance estimators 


We first extend the result on variance estimation for the 
scalar case U => b/d, (Section 2.2) to the vector case 
U=>U,d, =>dbjd,(s), where b, =U,h, is a p-vector 
and U, isa px(p+l) matrix with rows nue oe a 
In this case, the SYG variance estimator (2.4) is changed to 
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est(/) = 9, (U) 


T— Ty) 
=v yd, a (b, b,) (b, b,)’. (3.2) 
Tl, TU 


Vets 


Similarly, the H-T variance estimator (2.5) is changed to 


(Me — MT) 


est(1)=9,,(OC)=>> d,,(s) bb). (3.3) 


et 


Turning to the component J/ of the total variance of U, 
(2.6) is changed to 


est(II) = >) >) d,,(s)U, cov, (M,, h, )U/. (3.4) 


The total variance of U is estimated by the sum of (3.2) 
and (3.4) for fixed sample size designs or by the sum of 
(3.3) and (3.4) for arbitrary designs. 

A linearization variance estimator of the total variance of 
@ is obtained from the estimated total variance estimator of 
U_ by replacing U , by the linearized variable Z, = 
Of (A, )/ 0B, |4,-4,- Following the implicit differentiation 
method of Demnati and Rao (2004), Z, reduces to 


Z, =(J(6)J'g, (d(s))(-B/ t,,1,). 
with 
B, =[Yd,(se,t,t? | ad (s) opti] @), 


J(8) =—>°d,(s) g, (d(s)) (A, (0)/ 00"), 


and I, is the px p identity matrix. 
After some simplification, the first component est(/) is 
given by (3.2) or (3.3) with b, changed to 


Zh, =(J()J'e, (0) g, (d(s)), (3.5) 


where 
e, (0) = 1, (0) — B t,. 
Similarly, the second component est(//) simplifies to 


est(//) = 
[J(@)T' dd, (s) g, (d(s))1, (8) 1; (@)[J(O)T', (3.6) 


if Cov, [1,(6)1/(6)]=0 for k #t. 
The total variance estimator of 8 is now estimated by 


Gop (8) = est(/) + est(//). (3.7) 


This variance estimator of 6 automatically takes account of 
the g-weights as in Section 2. 

A customary variance estimator of 6, 9.,,(0), is 
obtained from (3.7) by ignoring the g-weights in (3.5) and 
(3.6). Similarly, a hybrid variance estimator, 9, (@), is 


mix 
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obtained from (3.7) by retaining the g-weights in est(/) 
and ignoring them in est(//). 


3.3 Simulation study 


We conducted a simulation study to compare the relative 
performances of the three variance estimators 9,,, 9.,.5 
and 3. .., for the special case of a logistic regression model: 


E,,(y,) =u, (8) = exp(x; 0) /{1 + exp(x,6)} (3.8) 


Vy) =H, (8) — p, (8)), Cov,,(,, ¥,) =0, & #0. 


m 


In this case, we have /, (0) = x, (y, —u,(8)), and 
J(0) = Sod: (s) 2, (d(s))x,X; My, (6) — Hp, (8)). 


For the simulation study, we set x, =(1, ean S where 
the x, denote the number of beds for the Hospitals 
population of size N =393 studied in Section 2.2. We 
implemented post-stratification by dividing the population 
into two classes with N,=171 hospitals k having 
x, <350 in class 1 and N,=122 hospitals k with 
x, 2350 in class 2. Here, g,(d(s))=N,/N,, h=1,2, if 
k belongs to class A, where N,=Dd,(s)t,, is the 
design-weight estimator of N,, and t, =(t,,t,,)' is the 
vector of class indicator variables ¢,,. 

We generated R =40,000 finite populations {y,,...,Vy}, 
each of size N=393, assuming the logistic regression 
model (3.8) with @=(6,,0,)’ =(-1, 0.005)’. The para- 
meter of interest is 0, =0.005. From each generated 
population, we selected one simple random sample of size 
n=150, and then obtained the calibration-weighted esti- 
mated 0, and associated variance estimators est(/)= 
9, (8,), Sop (9), Foe(8,) and 9... (6,) from each sample 
r. We obtained the averages of the estimates and the 
variance estimates as av(0,) = 0.00514, av(95p) = 0.0989, 


mix 


0.25 
0.2 > 
0.15 | \ 


Relative Bias 


av(%.,,,) = 0.0987, av(9.,,) = 0.0988, and av(9,) ~ 0.0613. 
Also, the estimated total MSE of 6, is equal to 0.0998. 
Hence, unconditionally the estimator 6, is approximately 
unbiased for 0,, and the bias of the three variance 
estimators 9,,, 9,,, and 9... 1s negligible. On the other 
hand ignoring the second component and using only the 
first component, est(/)=9, (6, ), leads to severe 
underestimation, as expected. 

We also examined the conditional performances of the 
three variance estimators along the line of Section 2.2. We 
arranged the 40,000 samples in ascending order of the 
sample size, n,, in class 1, and then grouped the samples 
into twenty groups, each of size 2,000, such that the first 
group, G,, contained the 2,000 samples with the smallest 
n,-values, the second group, G,, contained the 2,000 
samples with the next smallest 7, -values, and so on to get 
twenty groups, G,, ..., Gyo. 

We calculated the conditional MSE of 6, and the 
associated conditional relative bias (CRB) of the variance 
estimators 9p,, 9,,, and %,,,, based on the average values 
of Gop, Fy, and 9, in each group; see Figure 5. We can 
see from Figure 5 that CRB of %,,. ranges from 20% to 
-20% across the groups, whereas 9,, exibits no such trend 
and its CRB is less than 5% in absulate value except for two 
groups. Also, the CRB of 9%... exhibits a trend but less 
prononced than @,,.. Figure 6 reports the conditional 
coverage rates (CCR) of normal theory intervals based on 
Sor> Feys and Y,,;, for nominal level of 95%. We can see 
from Figure 6 that 9.,, exhibits a trend across groups with 
CCR ranging from 97% to 92%, whereas CCR associated 
with 3,, 1s close to the nominal level across groups. 
Further, CCR associated with 9... is slighthy above that of 
O5p for the first half of the groups and slighty below for the 
remaing groups. 
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Figure 5 Conditional relative bias of variance estimators: logistic regression 
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Figure 6 Conditional coverage rates of normal theory confidence intervals for nominal level of 95%: logistic regression 


Concluding remarks 


We have studied the estimation of total variance of 
estimators of model parameters under an assumed super- 
population model. Our approach leads directly to a 
linearization variance estimator which is shown to perform 
well under a conditional framework when calibration 
weights are used for estimation. We are currently inves- 
tigating extensions of our method to estimation of total 
variance under imputation for item nonresponse and 
integration of two independent surveys. 
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Statistical foundations of cell-phone surveys 


Kirk M. Wolter, Phil Smith and Stephen J. Blumberg ' 


Abstract 


The size of the cell-phone-only population in the USA has increased rapidly in recent years and, correspondingly, 
researchers have begun to experiment with sampling and interviewing of cell-phone subscribers. We discuss statistical 
issues involved in the sampling design and estimation phases of cell-phone studies. This work is presented primarily in the 
context of a nonoverlapping dual-frame survey in which one frame and sample are employed for the landline population and 
a second frame and sample are employed for the cell-phone-only population. Additional considerations necessary for 
overlapping dual-frame surveys (where the cell-phone frame and sample include some of the landline population) are also 
discussed. We illustrate the methods using the design of the National Immunization Survey (NIS), which monitors the 
vaccination rates of children age 19-35 months and teens age 13-17 years. The NIS is a nationwide telephone survey, 
followed by a provider record check, conducted by the Centers for Disease Control and Prevention. 


Key Words: Cell-phone study; Random digit dialing; Dual-frame survey; Network sampling; Indirect sampling: 
Linking rules; Weighting of survey data; National Immunization Survey. 


1. Introduction 


The number of persons with cell phones in the USA has 
increased rapidly in recent years, and the percent of adults 
living in households with cell phones is expected to soon 
exceed the percent living in households with landlines 
(CTIA 2008; Blumberg and Luke 2008; Arthur 2007; Ehlen 
and Ehlen 2007). Correspondingly, survey researchers have 
begun to experiment with the sampling and interviewing of 
cell-phone subscribers (Lavrakas, Shuttles, Steeh and 
Fienberg 2007). This article is about the issues of statistical 
design and estimation that arise in cell-phone surveys. It 
emphasizes theoretically rigorous but practical solutions to 
the emergent problems survey researchers are facing in cell- 
phone surveys today. 

Standard telephone surveys driven by random-digit- 
dialing (RDD) sampling only cover the population of 
households that have at least one working landline 
telephone actually used for voice communications. In an 
RDD survey, one assumes that the landline telephone is a 
household appliance and that all persons in the population 
are attached to one and only one household. Thus, one can 
sample people indirectly by sampling their telephone 
numbers and proceed from there to use reasonably standard 
and well-known methods of estimation. 

The cell-phone survey brings a paradigm shift and new 
challenges. Most people think of the cell phone as a 
personal appliance, not a household device. Some people do 
share a cell phone, including 10-20 percent of cell-phone- 
only adults (Carley-Baxter, Peytchev and Lynberg 2008), 
but many do not, and thus it cannot be assumed that all 
residents of a household can be reached through the same 


cell-phone line. Some residents of a household can be 
reached through more than one cell-phone line. Some 
residents can be reached only by a cell-phone line while 
others can be reached through both cell and landline 
telephones. Thus, in the cell-phone survey, the household 
may no longer provide the same unifying organization that it 
does in standard telephone surveys. 

To address the growing risk of bias (due to under- 
coverage) in telephone surveys, one can consider dual-frame 
telephone survey designs that include both an RDD sample 
of landline telephones and a sample of cell-phone lines. The 
telephone numbers on the two sampling frames are non- 
overlapping, but the corresponding people and households 
that may be the objects of the survey are partially 
overlapping. 

A rigorous theory of estimation for such telephone 
survey designs has been lacking, although some initial 
descriptions of weighting have been advanced by Brick, 
Dipko, Presser, Tucker and Yuan (2006), Brick, Edwards 
and Lee (2007), and Frankel, Battaglia, Link and Mokdad 
(2007). In this article, we provide a general theory of 
unbiased estimation for population totals in the context of 
dual-frame telephone survey designs and derive the 
corresponding survey weights. We show what information 
must be collected in the survey itself to enable the 
calculation of the sampling weights. 

To introduce ideas, we let A signify the portion of the 
overall population of interest accessible through the landline 
sampling frame, let B denote the portion accessible through 
the cell-phone sampling frame, and let C denote the portion 
not accessible through either frame (the phoneless 
population and other relatively small components of the 
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total population). We let a be the subpopulation in A not 
accessible through cell-phone lines (the J/andline-only 
population), let b be the subpopulation in B not accessible 
through landlines (the cell-phone-only population), and let 
ab be the subpopulation accessible through both landlines 
and cell-phone lines (the mixed population). We will 
sharpen this notation in succeeding sections. 

Whether or not a unit in the population of interest is 
accessible through landlines or cell-phone lines is itself a 
complex matter. Throughout this article, when we say that a 
unit is accessible through landlines, we shall mean that there 
is both physical access to one or more landlines (usually 
residential landlines only) and a respondent would actually 
answer the landline if it rang for voice communications. 
Many adults today maintain a landline telephone strictly for 
computer communications and utilize a cell phone for all 
voice communications. By our definition, such adults are 
not considered to have landline access and instead are 
considered to be in the cell-phone-only population. Simi- 
larly, when we say that a unit is accessible through cell- 
phone lines, we shall mean that there is both physical access 
to a cell phone and intent to answer the cell phone if it rang. 
All other units in the population of interest that are not 
accessible through either landlines or cell-phone lines are 
considered phoneless. Current evidence suggests, although 
no one knows for sure, that about 20 to 30 percent of adults 
are domain 6, 5 to 10 percent are in domain C, and the 
balance are spread across domains a and ab. 

What we know so far from the cell-phone surveys we 
and others have conducted is that the data collection is 
relatively expensive, with average-interviewer-hours-per- 
completed case running around three times the average for 
standard RDD surveys. The higher cost is brought, in part, 
by the legal requirement (in the US, the Telephone 
Consumer Protection Act) of manually dialing the selected 
cell-phones. Response rates are somewhat lower than those 
achieved in RDD surveys. Interview length may be 
problematic, with some respondents less willing to submit 
to a lengthy interview by cell phone than by landline phone. 
Privacy issues may constrain the cell-phone interview, if the 
respondent is not in a private place at the time of the 
interview. The cell-phone user’s propensity to respond may 
vary monotonically with his or her level of use of the cell 
phone, with the heavy user more willing to answer the 
phone than the lighter or occasional user. Most breakoffs 
occur during the opening seconds of the interview attempt. 
Because cell-phone surveys are relatively new, people are 
not used to being called and the interviewer has mere 
seconds to sell the survey. On the other hand, we find many 
cell-phone respondents to be quite cooperative once their 
attention has been held through the survey’s introductory 
script. 
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Due to all of these circumstances in the environment, we 
currently view the cell-phone sample as a relatively small 
supplementary sample, with the main sample continuing to 
be a larger RDD sample of landlines. The cell-phone sample 
is intended to round out the coverage of the population of 
interest. In the future, as the environment matures and if 
costs come down, it may be possible to shift towards a more 
balanced approach with similarly sized landline and cell- 
phone samples, or even to a state where the cell-phone 
sample begins to dominate and the landline sample is used 
as a supplement to round out coverage. 

In Section 2, we introduce the topic of networks of 
sampling units, reporting units, and estimation units and 
show how cell-phone surveys equate to a sampling of 
networks. Section 3 introduces various key concepts that 
will be needed as we discuss survey estimation, among 
them being the idea of a /ink (or edge) between the nodes 
(or vertices) in the network. Section 4 describes the duality 
that exists between the populations corresponding to the 
different types of nodes. Our approach will remind some 
readers of Lavallée’s (2007) methods for indirect sampling. 
The heart of the paper is Section 5, which sets forth 
unbiased estimators of population totals for cell-phone 
surveys and for corresponding dual-frame telephone survey 
designs. Section 6 gives an example, illustrating implica- 
tions of the new methods of estimation for an existing 
telephone survey regarding the vaccination coverage of 
young children and teenagers. We close in Section 7 with a 
brief summary. 

Throughout the article, we emphasize the development of 
rigorous but practical design and estimation procedures for 
population B. The methods of RDD surveys, i.e., the 
methods for population A, are well known and, to a degree, 
have been used for decades; for a recent review of these 
methods see Wolter, Chowdhury and Kelly (2008). 


2. Networks of units and the response protocol 


In general, at least three types of units arise in the context 
of a cell-phone survey, as follows: 
Sampling units (SU) 
Reporting units (RU) 
Estimation units (EU). 


The SU is the unit of sampling in the survey. In actual 
practice, telephone numbers may be sampled directly from 
cell-phone frames, or they may be sampled in stages, with 
perhaps exchanges or banks of numbers serving as the 
primary sampling units and numbers themselves being 
selected in one or more stages of subsampling within the 
primary units. To keep the discussion simple, in this article 
we will present the telephone number itself as the SU. 
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The actual target of the survey interview and the unit of 
analysis is what we shall call the EU. Some surveys focus 
on the collection and analysis of data on households or 
families, in which case the household or family is the EU. 
Other surveys focus on person level data, where the eligible 
persons may be children under age 18, adults age 18+, or 
some demographic segment of the population, such as 
Hispanic females aged 0-34. Still other surveys focus on 
both household- and person-level data, in which case the 
survey involves at least two types of EUs and two levels of 
analysis. 

The adult is the respondent or RU in telephone surveys. 
The EU may or may not have the capacity to respond 
directly for itself, and instead an RU responds on its behalf. 
If the EU is an adult, then the same adult or even a different 
adult may serve as the corresponding RU. If the EU is a 
household, family, consumer unit, or child, then one or 
more adults may serve as the corresponding RU. The 
response protocol, specified by the survey methodologist, 
actually determines which RUs are permitted to respond for 
which EUs. In a typical survey, one respondent adult (or 
RU) would be contacted by telephone and interviewed for 
each SU selected into the sample. 

SUs, RUs, and EUs may bear different relationships to 
one another in a cell-phone survey. Figure | gives nine 
networks that illustrate some of the types of relationships 
that are possible. In the first network, one SU is linked to 
one RU, which in turn responds for one EU. This 
arrangement could occur if one adult uses one telephone 
line, and the adult in turn reports for the household or for 
him or herself or for one child. In the second network, one 
SU is linked to two RUs, each of which can respond for the 
EU. This arrangement would occur, for example, if two 
adults shared the same telephone line and each was 
permitted by survey protocol to respond for the household. 
The fifth network could occur if two adults each had their 
own telephone line not shared with the other adult, while 
each adult in the pair is allowed by survey protocol to 
respond for each of two children. 

More complicated networks are possible and surely must 
exist in the world. For example, the eighth network shows 
an arrangement of three adults sharing two telephone lines. 
The first of the lines is shared by all three adults, while the 
second line is only used by the third adult. The first of the 
adults is permitted by survey protocol to respond for two 
EUs, such as the adult’s biological children; the second 
adult is not permitted to respond for any EUs; and the third 
adult is permitted to respond only for a third EU that is not 
reportable by the first two adults. 
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Figure | Examples of networks in a cell-phone survey 


3. Links between units in the network 


A link is a salient relationship between two nodes in the 
network. In the context of Figure 1, the links are represented 
by the line segments that join the different nodes. To 
provide a foundation for survey estimation, we need to 
explore links between (1) RUs and SUs, (ii) EUs and RUs, 
(111) and EUs and SUs. 


3.1 Link of RU and SU 


Two concepts are central to creating a link between an 
RU and an SU, namely, the concepts of (a) an Active 
Personal Cell Number (APCN) and (b) usual access to the 
cell-phone line. 

An APCN is a telephone line that is in service at the time 
of the cell-phone survey and can ring through to an eligible 
adult who uses the cell phone, at least partially, for personal 
matters. In other words, an APCN meets three tests: 


It is in service 
It connects to an eligible adult respondent 
It is not used exclusively for business purposes. 


We say that a given adult has usual access to a given 
APCN if and only if the individual has 


Regular, 
Substantial, and 
Ongoing use of the cell-phone line. 
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Each APCN has one or more regular adult users, and 
each individual user has usual access to one or more cell 
phones. In many cases, there is a unique one-to-one 
relationship between the cell-phone line and the adult user. 
In some cases, there is a one-to-many relationship between 
the cell-phone line and its users. 

We treat a given SU and a given RU as linked if and only 
if the SU is an APCN and the RU has usual access to the 
SU. A cell-phone survey must work with and recognize the 
links that exist between the population of SUs and the 
population of RUs. 


3.2 Link of EU and RU 


A given EU is linked to one or more RUs via natural 
relationships that exist in the world, such as those created by 
family or place of residence. For example, an adult 
respondent may respond to the survey interview on behalf 
of his or her household, family, or consumer unit. He or she 
may respond for him or herself, for a dependent child under 
age 18, or for his or her own parent or sibling. 

All surveys require a response protocol that defines 
which adult respondents are to respond for which EUs. The 
protocol is selected by the survey methodologist in light of 
feasibility, cost, and accuracy-of-reporting concerns. It is 
this protocol that establishes the links between EUs and 
RUs. 


3.3. Link of EU and SU 


The foregoing links between RUs and SUs and between 
EUs and RUs determine the links between EUs and SUs. 
We say a given EU is linked to a given SU if and only if the 
EU is linked to at least one RU that in turn is linked to the 
SU. 

Some notation will become useful in our work in the 
following sections. Let 7 denote a given EU in the 
population of interest and let i be a given SU in the 
population. Then define the indicator or link variables 


1, ifthe / EU is linked to the i" SU 


= (0, otherwise. 


4. Duality between the populations 
of SUs and EUs 


To begin the process of determining an unbiased esti- 
mation procedure for cell-phone surveys, we establish that a 
duality exists between the population of SUs or cell phones 
(henceforth denoted by U*”) and the population of EUs 
that are linked to cell phones (denoted by U'”). The goal of 
a cell-phone survey is to make inferences concerning U'®, 
but we will soon see that this goal is equivalent to making 
certain inferences concerning U*” (in this notation, the first 
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superscript designates the type of unit while the superscript 
B refers to the cell-phone sampling frame. Later we will use 
the superscript A to signify the landline sampling frame). 

In the EU domain, a population total of interest is given 
by 


where the Y-variable on the right-hand side is a 
questionnaire item or other recoded or derived variable 
attached to the units in the population U'”. Similarly, in the 
SU domain, a population total is defined by 


x8 => YX, 
ieU™® 

where the X-variable on the right-hand side is any fixed 
characteristic attached to the units in the population U*”. 

While the interest of the survey analyst centers on the 
total from the population of EUs (and on other parameters 
of this population), one can obtain a corresponding 
parameter in the SU domain by writing 


They, ye be 


jeu® jeUp ieUS® icUS® 


ySB 
v'eU 


where the X-variable is now defined specifically by 


ee 
a ee (2) 
2 bat 
feu? 


From (1), one can see the correspondence between 
estimation in the SU domain and estimation in the EU 
domain. The total X°", with X ; defined as in (2), 1s 
equivalent to the total of interest Y'", and thus the problem 
of estimation of Y'? is equivalent to the problem of 
estimation of X°”. 

We note that (2) arises in substantially the same form in 
the theory of indirect sampling. See Lavallée (2007), 
Theorem 4.1. In indirect sampling, SUs are linked to 
naturally defined clusters of EUs; if a given SU is selected 
into the sample, the survey data are collected for all EUs in 
the linked clusters. The analogy here is that the clusters are 
defined by the RUs that respond to the cell-phone interview 
attempt, and survey data are collected from the respondent 
for all EUs to which he or she is linked. The current 
situation is such that the cluster is defined by the SU-RU 
pair. An identifiability problem arises in this regard that 
does not occur in general in indirect sampling, and we 
elaborate on this matter in Section 5.5. 

In (2), we effectively allocate an equal share of iv ; to 
each SU i to which it is linked. We could, alternatively, 
achieve the same ends by allocating Y, to its linked SUs in 
proportion to some other known measure of the intensity of 
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the relationship between j and i. Although one could 
conceive of an optimal allocation of Y, to its linked SUs, as 
in Deville and Lavallée (2006), such an allocation may be 
difficult to execute or may not be of great import in large 
scale practical settings. 


5. Estimation 


As mentioned in the introduction, some EUs will be 
linked exclusively to cell phones, some will be linked 
exclusively to landlines, and some will be linked to both 
landlines and cell phones. Phoneless EUs, if any, will not be 
linked to cell phones or to landlines. To provide notation for 
this environment, let U* be the overall population of EUs 
of interest, and let U* be the overall population of SUs. Let 
U** be the elements of U* that are linked to landlines, let 
U*® be the elements that are linked to cell-phone lines, let 
U* be the elements that are linked only to landlines, let 
U* be the elements that are linked only to cell-phone lines, 
let U™ be the elements that are linked to both landlines 
and cell-phone lines, and let U'* be the elements that are 
phoneless. Note that U® =U® UU" UU, UM = 
Ce te) = wand ee Le iL? where UU 
and U'” are disjoint sets. Also, let U®* be the population 
of landlines, such that US =U°“ UU*%®. Landlines and 
cell-phone lines reflect disjoint subsets of the overall 
population of SUs. 

In the following Sections 5.1 and 5.2, we discuss 
unbiased estimation for the subpopulation, say UT = 
U'* UU", that is linked to at least one telephone of any 
kind. We use the super-script T to designate this telephone 
subpopulation. Subsequently, in Section 5.4, we briefly 
discuss coverage of the phoneless population. 

For EUs in U, define the indicator variables 


6, = 1, ifnone of the RUs linked to 7 have access 
to landline service, while at least one of 


these RUs has usual access to cell-phone 


service 
= 0, otherwise 
, = 1, ifnone of the RUs linked to j have usual 


access to cell-telephone service, while at 
least one of these RUs has access 
to landline service 


0, otherwise. 


The 6 -variable is an indicator of cell-phone-only status 
and the ¢ -variable is an indicator of landline-only status. 

Then the population total of interest may be decomposed 
as 


207 
yore yeh Yee; (3) 
where 
Eb _ 
Nee », 5,Y, 


is the total of the cell-phone-only domain, and 


y= (1-8,)¥, 


jeu™ 


is the total of the complement of this domain, including EUs 
that are linked exclusively to landlines and mixed EUs that 
are linked to both landlines and cell phones. The total of 
EUs may also be written as 


yes = yee a yy ee say eos (4) 
where 


ie a » 0); 


jeUE) 
is the total of the landline-only population, and 


ee 0) 


revit 


is the total of the mixed population that has a combination 
of landline and cell-phone access. Finally, the population 
total may be written as 


yet = yee a Va. (5) 
where 


Mead Sos )Y, 


jeu! 


is the total of the complement (in the telephone population) 
of the landline-only population. 

We view (3) and, to some extent, (4) as the decompo- 
sitions of current practical interest and importance in 
telephone surveys in the USA and, in what follows, we 
present methods of estimation for each. Because of the 
current high relative cost of cell-phone interviews, surveys 
based on decomposition (5) would not be cost effective. It 
would almost always be better to represent the domain U"*” 
using a sample of landlines than using a sample of cell 
phones. If the relative cost of cell-phone interviewing shifts 
downward in the future, decomposition (5) could become 
economically viable. It may also be viable for surveys in 
other countries where the cost structure is more favorable to 
cell-phone interviews. 


5.1 Case of nonoverlapping domains 


In this section, we will use a sample of cell-phone lines 
for purposes of estimation for the cell-phone-only 
population U"” and a sample of landlines for estimation for 
the entire landline population U"*. We observe that it is not 
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possible to directly select a sample of cell-phone-only lines, 
because cell-phone-only status is not available on the 
sampling frame but rather is determined in the survey 
screening interview. To operationalize this design, one 
would screen-out cell-phone respondents who classify 
themselves in the mixed domain and terminate the inter- 
view, continuing the interview only for cell-phone-only 
respondents. 

Let s°” denote a probability sample of SUs (cell-phone 
lines) selected from the population U*’, and let {W°?} 
denote the set of base sampling weights such that 


is an unbiased estimator of the population total X°”, where 
X, is a characteristic of the i unit in the population. 
Assuming simple random sampling without replacement 
within strata, the base weights are of the form 


WON A Ni; (6) 


where / signifies the sampling stratum in which the i SU 
is selected, NV, is the number of SUs on the sampling frame 
in stratum h, and n, is the sample size in stratum /. 
Typically, the cell-phone sampling frame would include all 
telephone numbers within the exchanges assigned by the 
telephone system to cell phones. Simple random sampling 
would be the most common method of sample selection 
from such exchanges. There is little information available 
on the cell-phone sampling frame to enable stratification of 
the sample, except for the coarse geographic information 
embodied within the area code. 

Let s'® be the corresponding sample of EUs, ie., 
s*®= {7 eU**| ; is linked to at least one SU i ins*"}. We 
will use this sample to estimate the domain total of EUs that 
are linked only to a cell phone, Y®. From (1) and (2), we 
can readily see that the unbiased estimator of the domain 
total is given by 


i a ya ia ny. 3; Y; ‘a / 


jesSB jeu™ 


os ‘a 


eg 
EB 
ad » 8; Y, W; yi (7) 


where the EU level sampling weights are defined by 


We ae » Han oe », a for peasy (8) 
etn 


SB 
1ES 


Again, see Lavallée (2007) for expression of these 
weights in the context of indirect sampling. 

Before leaving domain b, we observe in passing that it is 
possible to subsample the EUs and collect the survey 
information only for the subsample instead of enumerating 
all EUs linked to the sample RUs. If the statistician would 
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choose some form of subsampling, perhaps to control 
sample size or cost, then an additional weighting factor 
would appear in the weights in (8). Such subsampling is 
referred to as two-stage indirect sampling in Lavallée (2007, 
Section 5.1). 

Turning to domain A, let s°** denote a standard RDD 
sample of landline telephones, let s*“ be the implied 
sample of EUs, ie., s*“ ={j €U™ | ; is linked to at least 
one SUi in s**}, and let 


Py ) 


an 
jes" 


be the standard unbiased estimator of the population total. 
For brevity, we shall not derive the standard sampling 
weights here; for more information about these weights, see 
Wolter et al. (2008). 

From (7) and (9), the unbiased estimator of the 
population total of the EUs is given by 


Vee VE tee Yo (10) 


and the weights needed to support this estimator are {Wr*} 
and {W;"?}. 


5.2 Case of overlapping domains 


We now proceed with estimation starting from the 
decomposition (4). This means that in the cell-phone sample 
we will interview not only the cell-phone-only population, 
but also the mixed population (i.e., those that use both 
landline and cell telephones). The estimator of the popu- 
lation total of interest is now of the form 


yet zy" yo PD pas (11) 


where 


is the estimator for the landline-only domain derived from 
the landline sample, Y* is defined in (7) and is the esti- 
mator for the cell-phone-only domain derived from the cell- 
phone sample, and Y'* is an estimator of the mixed 
domain obtained from both samples. The estimator of the 
mixed domain is 

yr a5 Dx Vi saad see 


EA 


+(1-A) ) Ww? 1-8, Y,. (12) 


jes® 


The weights need to support estimator (11) are a 
and {Wi}. 

See Hartley (1962) for discussion of the mixing para- 
meter 4 ina dual-frame survey, focusing on considerations 
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of sampling variability. Turning to considerations of bias, 
Brick et al. (2006) report that the propensity to respond to a 
cell-phone survey may be positively related to the frequency 
of use of the cell phone. Thus, the two pieces on the right 
side of (12) may be subject to a differential nonresponse 
bias not removed by the standard weighting-class methods. 
In the mixed population, infrequent users of the cell phone 
may be less likely to respond if surveyed in the cell-phone 
sample than if surveyed in the landline sample. If these 
adults would be substantially different from other adults in 
the mixed population with respect to the key characteristics 
under study in the survey, then (12) and also (11) could be 
subject to a nonreponse bias. 


5.3 Variance estimation 


To make inferences from the sample to the overall 
population, we require an estimator of the variance of the 
estimated total. First, consider the case of nonoverlapping 
domains. By working in the SU population, we can employ 
methods of variance estimation appropriate to the survey 
design. From (7), the estimated total for the cell-phone only 
domain may be written by 


ye = »y Wey. 


where 


Ape ero ‘| psig, (13) 
eye 


jeu 


Assuming simple random sampling, the unbiased esti- 
mator of the variance of the estimated total is given by 


vPy=> ni (1-7 | =a 
tind h N xh? 


h=1 h 


where 


If we would ignore the finite population correction factor, 
which would be possible in almost any real telephone 
survey, the variance estimator becomes 


1 yy [mix x x, | 04 


h=1 Ny, ol EF Bs nN, esra 


Now let v(Y®*) be an estimator of the variance of Y"™ 
for the RDD sample of landlines. Such estimators are well 
known and we do not review them here; see for example, 
Wolter et al. (2008). Because sampling is independent in the 
landline and cell-phone sampling frames, the unbiased 
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estimator of the variance of the estimated total for the entire 
telephone population becomes 


Vive evil Savy e ). (15) 


To facilitate the following developments, we let V"?[8Y] 
be another symbol to represent the estimator of variance in 
(14). This notation will emphasize the fact that the estimator 
of variance is based on the YX, variable in (13) defined in 
terms of the characteristic 6 Y,, which is the characteristic of 
interest for cell-phone-only EUs. Also, let the symbol 
V®ATY] be the estimator v(Y"*) defined in terms of the 
characteristic Hee With this notation, (15) becomes v(Y'') = 
VOT eV = 1Oy 1. 

Second, consider variance estimation for the case of 
overlapping domains. The estimator of the total of the 
telephone population is now Y*' in (11). For fixed A, the 
unbiased estimator of variance is clearly seen from the work 
done in (14) and (15). It is 


v(YF") =V [oY +A (1-0) Y] 
Tr ered =2)0-o)r (16) 


The first term on the right side of (16) is the variance 
estimator for the RDD sample of landlines applied to the 
composite characteristic $, ¥,+A(1—,)¥,, which is the 
characteristic for landline-only EUs plus a A -portion of the 
characteristic for mixed EUs. The second term on the right 
side of (16) is the variance estimator for the cell-phone 
sample applied to the composite characteristic 6, Y, + 
(l-A)d- oF )Y,, which is the characteristic for cell-phone- 
only EUs plus a (l1—A)-portion of the characteristic for 
mixed EUs. 

Estimators of covariance matrices can be built up from 
expressions like (15) and (16), facilitating statistical infer- 
ence concerning other population parameters of interest. 


5.4 Adjustments of the sampling weights 


The sampling weights may be adjusted because of non- 
response or a planned calibration to known control totals. 

Thus far, we have not addressed the various types of 
missing data that may occur in a cell-phone survey. We will 
focus on deriving adjustments for missing data that arise 
during the cell-phone interviews, assuming that standard 
adjustments for missingness in the landline sample have 
already been incorporated in the {W;"} weights. 

Missing data can arise due to three factors: (i) non- 
resolution of the SU; (ii) an incomplete screening interview 
of the RU; and (iii) an incomplete main interview of the RU. 
In this article, we adopt the convention that the resolution 
step refers to the classification of the SU as an ACPN or 
something else, such as a disconnected line or a dedicated 
business line; nonresolved SUs and SUs resolved as 
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non-ACPNs do not continue with the interview. The 
screening step refers to a bref preliminary interview 
intended to ascertain telephone status and to determine any 
demographic or other eligibility characteristics of any EUs 
linked to the RU; RUs for which the screening interview is 
incomplete or for which the screening interview is complete 
but no eligible EUs are linked to the RU do not continue 
with the interview. If the survey protocol calls for including 
only cell-phone-only EUs, as in Section 5.1, then the 
interview would terminate at this point for any mixed EUs. 
On the other hand, if the survey protocol calls for including 
both cell-phone-only and mixed EUs, as in Section 5.2, then 
the interview would continue for all such EUs. The 
interview step refers to the collection of the main survey 
items that form the substance of the survey for each of the 
eligible EUs linked to the RU. The survey methodologist 
must institute a definition of what constitutes a completed 
interview. In particular, the methodologist must decide 
whether breakoffs (an interview attempt that is completed 
for some but not all of the eligible EUs linked to the RU) are 
to be treated as a completed interview or not. Some other 
authors may organize the steps in the survey response 
process somewhat differently than the convention adopted 
here. 

Adjustments to the sampling weights can be made for 
nonresolution and screener nonresponse, assuming a 
missing-at-random model for the response mechanism. 
These two adjustments must be made at the SU level. Let 
{5°} be a partition of the cell-phone sample into user- 
specified weighting cells a, and let the base sampling 
weights from (6) now be denoted by W,°”, where the 
subscript | has been added simply to signify the first step in 
a multi-step adjustment process. Telephone area codes, rate 
centers, and census environmental variables at the county or 
area code level can be used to form the weighting cells; 
otherwise, little covariate information is _ available 
concerning cell-phone numbers. The cell-specific resolution 
completion rates are defined by 


where 7,, 1s a resolution indicator variable (= 1, if resolved, 
= 0), if not resolved), and the nonresolution adjusted weights 
are WW, = Fae hg Or Gaeas 

Let e,, be an indicator of whether / is a resolved APCN 
(= 1, if resolved APCN, = 0, otherwise), and let AS be 
a partition of the cell-phone sample into user-specified 
weighting cells, which could be the same as or different than 
the foregoing partition. Then, the cell-specific screener 
completion rates are 
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where 7, is a screener indicator variable (= 1, if screener 
completed, = 0, if screener not completed), and the screener- 
nonresponse adjusted weights are W,° = 1, e,W,° / Rog 
for i€ Chet Note that the appropriate sum of the weights is 
preserved at each step of the adjustment process. 

Next, an adjustment to the sampling weights must be 
made for interview nonresponse. Depending on how break- 
offs are classified by the survey methodologist, there may 
be two cases to consider: (i) the RU completes or fails to 
complete the interview for all of its linked and eligible EUs 
en masse, or (11) the RU selectively completes or fails to 
complete the interview on an EU by EU basis. If breakoffs 
would be classified as incomplete interviews, then only 
Case i would apply. Let e,, be an indicator of whether the 
RU is screened and is linked to at least one EU that is 
eligible for the interview (= 1, if screened and eligible, = 0, 
otherwise), and let 7, be the interview indicator variable 
(= 1, if the interview is complete, = 0, otherwise). 

For Case i, the weight adjustment can be made at the SU 
level and is given by W2° =n, e,, Wee 1 Res for ies)’, 
where R;, is the weighted interview completion rate 
computed within user-specified weighting cells y. Again, 
options for constructing weighting cells are limited in a cell- 
phone survey; they may be specified in terms of the 
information available at the previous weighting steps or any 
information collected in the screening interview. The 
weighted interview completion rate is 


The estimated total for the cell-phone-only domain may 
now be expressed by 


Vee Oa (17) 


where 


Wa; = ya, Wi; fy 5 a 


ies eu 


and s'® is the set of eligible EUs reported in the screening 
interviews. The weight is zero for any eligible EUs in s'” 
for which the RU failed to complete the main interview. The 
estimated total for the mixed domain, if called for by the 
survey protocol, is defined similarly by 
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For Case ii, the noninterview adjustment must be made at 
the EU level. The EUs are treated as spawned cases and a 
decision is made for each one as to whether it has a 
completed interview or not. The estimated total for the cell- 
phone-only domain is (17), where the weight is now defined 
by 


pe EB wa EB 
Wy; =1h; 2; Wa, (R,, for jes 


EB SB 
W;, Pe W;; fi! Ds lis 
iesS® hee 
and 
EB 
ys 1 W;; 
R Vy j'est® 
3y EB 
Ds 


Here, the weighting cells, y, are defined in terms of 
characteristics of the EUs as determined from the screening 
interview and other sources. 

For either Case i or 11, to facilitate computations, take 
eer to be defined and equal to zero for EUs in the cell- 
phone sample, and take Ye to be equal to zero for EUs in 
the landline sample. If the survey protocol is as in Section 
5.1, then we conclude that the survey weights for estimating 
the population total of interest are defined by 


WW EW 8 (18) 


for jes ', where s*' es Us'®. Otherwise, if the 
survey protocol is as in Section 5.2, then we conclude that 
the survey weights are defined by 


W, = aes {, +AC—o,)} 


taWae (Ores (iA) (2 6,)} (19) 


fOnehes. 

The nonresponse-adjusted weights from (18) or (19) may 
be calibrated (Deville and Sarndal 1992) to external control 
totals within socio-economic or geographic cells for the 
population of EUs, using poststratification, raking, or GREG 
(generalized regression estimation) techniques. If accurate 
sources are available, control totals may be established and 
calibration may be conducted separately for domains A and 
b or for domains a, ab, and b. If control totals are not 
available by telephone status, then calibration must use 
control totals for the entire population regardless of 
telephone status. 
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To illustrate these ideas, we briefly examine the GREG 
estimator. Let us suppose that we have available a |x p 
auxiliary variable Z, for the observed, eligible EUs for 
which the control totals Z*' => jeu Z, are known. For 
example, the z-variable may arise from a fully saturated 
model in terms of explanatory variables age, race, and sex. 
Let s;' be the set of EUs with a completed main interview 
and let m'=#(s;') be the number of eligible EUs 
reported in the completed interviews obtained within the 
consolidated telephone sample. Stack the y-values, z-values, 
and weights into the matrices Y =(¥,.. Ver) Zia Die, 
Z et) and W = diag (VW, . Woes y. Thent the GREG esti- 
mator (Cassel, Sarndal, adi (wireliati 1976) of the total of 
the telephone population of interest takes the familiar form 


Br EY isl 
Ve (7, 7, |) Dz dei E kp 


jest 


where the estimated coefficients are given by f= 
(ZIWZ)ETIWY) YS WG LSD we WED 
and g, =1+(Z"' -Z*")Z'. pawanee (2007, @iapten 7) 
derives the Taylor series estimator of the variance of the 
GREG estimator in an indirect sampling context. Also see 
Wolter (2007, Chapter 6) for estimation of the variance of 
the GREG estimator. 

Before leaving the topic of calibration, we note that we 
have largely left aside the small phoneless population, 
which fundamentally is impossible to sample in a telephone 
survey. Yet, in all likelihood, the overall population total 
y’ =y''+yY*° will be the parameter of interest, not the 
total of the telephone population Y*', and the known 
control totals used in calibration may be totals for the 
overall population Z* =Z''+Z*°, not totals for the 
telephone population Z‘'. To include the phoneless 
population, we may consider use of a revised GREG 
estimator with g,=1+(Z" —Z"')Z'. This revision takes 
the same model for the phoneless population as for the 
telephone population. See Keeter (1995) and Chowdhury, 
Montgomery and Smith (2008) for other considerations in 
the calibration of weights for the phoneless population. 


5.5 Identifiability assumptions 


The foregoing theory assumes fundamentally that if SU i 
is selected into the sample of cell-phone lines, then X, 
defined in (2) is observable in the cell-phone interview. Yet 
the 9" network (and also the 8") in Figure 1 illustrates a 
potential problem for the theory. For this network, two RUs 
are linked to one SU, and in turn each RU is linked to only 
one EU. To continue this illustration, we suppose that these 
two EUs are not linked to any other RUs in the population. 
At the time of the survey interview, only one of the RUs 
will typically be reached and interviewed (unless the survey 
protocol would specifically mandate that an interview be 
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attempted with each RU linked to the selected SU). The 
respondent RU will report for its linked EU, but by the very 
nature of this network, the respondent cannot report for the 
EU that is linked to the companion RU who shares the 
sample cell-phone line. Thus, there is at least one EU that is 
linked to the SU that cannot be observed, i.e., data cannot be 
collected in the cell-phone interview. Thus, we say X, is 
not identifiable. The situation regarding the reportability of 
the two EUs would be reversed if the cell-phone interview 
attempt would have rung through to the companion RU. 

To maintain the unbiasedness of the estimator of the 
population total, the X, must be identifiable for every 
respondent SU selected into the sample of cell-phone lines. 
We need to make one of two assumptions. First, we could 
assume the problem away by acting as if networks like 
numbers 8 and 9 either do not exist or are trivial in number. 

Secondly, the more realistic case would be to assume an 
extra randomization step, namely, that the interview call 
attempt to the given SU has reached a randomly selected 
RU linked to the SU. This randomization could be viewed 
as conceptual (that is, occurring naturally and not directed 
by the survey methodologist). To be formal and rigorous, 
one would need to collect information on the number of 
RUs linked to the SU and the probability that the cell-phone 
call attempt would ring through to the respondent RU. The 
probability would be approximated by the respondent’s self- 
report of his or her share of use of the cell phone. If only one 
RU is linked to the SU, then this probability is 1.0 and 
clearly this simple value would not need to be collected in 
the interview once it is reported that there is only one RU. If 
two or more RUs are linked to the SU, then the probability 
or share to be collected is denoted by t, for RUs indexed 
by k, where Y,<vs T, =1 and U;*? is the set of RUs that 
are linked to the i" SU. With this additional information in 
hand, an unbiased estimator of 


XG = S F Y, fi 
jeu? >. iy 
i'eU? 
is given by 
; 1 On) ee 
Vos Sy pp EE (20) 


where a, is an indicator variable signifying whether the 
k'" RU was the realized respondent or not for the i" SU in 
s*~ and 


1, if SUZ is linked to RU k which 
in turn is linked to EU 7 


ikj 


= 0, otherwise. 
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The data are now identified and one can plug (20) into 
(7), giving the revised estimator 


Vee Oe (21) 


with revised weights 


ws = Ds Wee 


a 
= ! > 0 
ies kee 
et SB 


As an approximation, one could take the RUs to be equal 
users of the cell phone, in which case t,, would simply be 
the reciprocal of the number of RUs linked to the SU i for 
all RUs &. Adjustments for nonresponse and calibration to 
control totals would proceed as before. 

Alternatively, the survey methodologist could call for a 
real randomization step, which would require that the 
interviewer make a roster of the RUs linked to the SU and 
select one at random, or a pseudo randomization step using 
the last birthday method. Such methods are probably not 
feasible at this time, due to the difficulty of gaining cooper- 
ation in cell-phone interviews. 


5.6 Implications for data collection 


Certain information must be collected in the survey 
interview in order to support the calculation of the esti- 
mators discussed here. 

To support the use of 6,, the cell-phone survey must 
collect information to establish whether any of the RUs 
linked to the EU have access to a landline telephone. The 
respondent RU must report this information both for himself 
or herself and for other RUs that may be linked to the EU. 

To support the use of $,, the landline survey must 
collect information to establish whether any of the RUs 
linked to the EU have regular access to a cell phone. The 
respondent RU must report this information both for himself 
or herself and for other RUs that may be linked to the EU. 
This report may be quite straightforward in the event that 
the response protocol only links EUs to RUs within the 
same household. For more complicated response protocols, 
the report could be difficult to obtain. 

To support the use of Yi, ys /;, in calculating the 
survey weights, the survey must collect information to 
establish how many SUs in the population are linked to the 
reported EU /. The respondent RU must be able to report 
the number of cell phones, including their own, that ring to 
an RU who is linked to the given EU. 

If the estimator given in (21) and (22) would be used in 
order to identify all of the EUs, then additional information 
must be collected in the interview. The respondent RU must 
know and report the number of RUs, including themselves, 
that are linked to both the selected SU and the reported EU. 
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The respondent RU must also know and report their share of 
use of the cell phone on which the interview is completed or 
be able to say that use is approximately equal. 


6. Example: The National Immunization 
Survey (NIS) 


We illustrate the information that must be collected in the 
survey interview using the NIS, a survey of parents of 
children age 19-35 months and of teens age 13-17 years 
sponsored by the Centers for Disease Control and Preven- 
tion (CDC) for the purpose of monitoring vaccination 
coverage rates (i.e., the proportion of children who are up- 
to-date with respect to the recommended vaccination 
schedule) in the USA. Data collection in the NIS occurs in 
two phases: an RDD telephone survey of households with 
landline telephones that have children or teens in the eligible 
age range, followed by a survey mailed to the vaccination 
providers of the age-eligible children. The sampling frame 
for the telephone survey phase of the NIS consists of all 
landline telephone numbers in 1+ banks in the USA. 
Cellular telephone numbers in dedicated cellular banks are 
currently not included in the NIS sampling frame. When a 
household with an age-eligible child is identified in the 
telephone survey, the interview is conducted with the adult 
in the household who is identified as the most knowledge- 
able about the vaccination status of the child (nearly always 
the mother or father). During the telephone interview, data 
are collected for each age-eligible child in the household, 
including the demographic characteristics of the child, 
demographic characteristics of the child’s mother, and 
socio-economic characteristics of the child’s household. At 
the end of the telephone interview, consent is asked to 
contact the child’s vaccination providers. If consent is given, 
all vaccination providers named by the telephone interview 
respondent are contacted by mail to obtain the child’s 
provider-reported vaccination history, which is used in 
statistical analysis to evaluate vaccination status. Smith, 
Hoaglin, Battaglia, Khare and Barker (2005) provide a 
detailed description of the statistical methods used by the 
NIS. 

Because of the growth of the cell-phone-only population, 
the proportion of the NIS target population that is covered 
by the landline sampling frame has decreased in recent 
years. Using data from the National Health Interview 
Survey, Khare, Singleton, Wouhib and Jain (2008) estimate 
that about 18 percent of eligible children and 10 percent of 
eligible teens may be missing from the NIS sampling frame. 
To address the increase in cell-phone-only households in the 
NIS target population, cell-phone interviews could be added 
to the NIS. 
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For the NIS, the telephone number is the SU, the 
knowledgeable mother or father is the RU, and the age- 
eligible child is the EU. For the landline RDD or A sample, 
the parent is a resident of the household to which the sample 
landline number is assigned, while for the cell-phone or B 
sample, the parent has regular access to the cell phone to 
which the sample telephone number is assigned. Children 
are not subsampled in the NIS, but rather the knowledgeable 
parent reports for all of their age-eligible children who live 
in their home (but not for any children who may live 
elsewhere). These elements of the survey protocol establish 
the links between RUs and SUs and between EUs and RUs. 

One comprehensive NIS design is to conduct estimation 
by way of nonoverlapping domains and decomposition (3). 
That is, the A sample is used to represent all children linked 
to a landline household and the B sample is used to 
represent all children linked to a cell-phone-only parent. We 
considered and rejected decompositions (4) and (5) due to 
considerations of cost and the potential for differential 
nonresponse bias in estimation for the mixed population. 

To implement the estimator in (10), we determine 
whether the A-sample child is landline-only through use of 
the following three questionnaire items: 


Al.Next I have some questions about cell phones in 
your household. In total, how many working cell 
phones do you and your household members have 
available for personal use? Please don’t count cell 
phones that are used exclusively for business 
purposes. 

A2.How many [of these] cell phones do [LIST ALL 
ELIGIBLE CHILDREN]’s parents and guardians 
usually use? 

A3.Of all the telephone calls that you and your family 
receive, are nearly all received on cell phones, 
nearly all received on regular phones, or some 
received on cell phones and some received on 
regular phones? (IF ASKED ABOUT INCLUDING 
BUSINESS CALLS: Please do not include any 
business-related calls in your answer). 


For the cell-phone or B sample, we establish whether the 
child is cell-phone-only using the following two questions. 


Bl1.Do you have a landline in your household? 
(INTERVIEWER PROBE IF YES: Please do not 
include modem only lines, fax only lines, lines used 
just for a home security system, beepers, pagers, or 
the cell phone). 

B2. Thinking just about the landline home phone, not 
your cell phone, if that telephone rang and someone 
was home, under normal circumstances how likely 
is it that it would be answered? Would you say 
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extremely likely, somewhat likely, somewhat un- 
likely, or not at all likely? 


We would use Question B2, due to Cantor, Brownlee, 
Zukin and Boyle (2008), to determine whether the landline 
is actually used for voice communications and thus whether 
the respondent is in the ab or b domain. 

Also for the B sample, to determine the number of cell 
phones in the population that are linked to a given age- 
eligible child, we would use the following two questions: 


B3. Next, I have some questions about cell phones in 
your household. In total, how many working cell 
phones do you and your household members have 
available for personal use? Please do not count cell 
phones that are used exclusively for business 
purposes, and please include the number we called. 

B4.How many of these cell phones do [LIST 
CHILDREN]’s parents and guardians usually use? 
Please include the number we called. 


Responses to questions Al-A3 and B1-B4 permit the 
calculation of survey weights and implementation of the 
unbiased estimator of the population total given in (10). 


7. Summary 


In this article, we used some theory of indirect sampling 
and network sampling to demonstrate a statistical frame- 
work for the design and analysis of cell-phone surveys. We 
exhibited an unbiased estimator of the population total with 
respect to estimation units linked to sampling units. By 
implication, this theory gives a means of constructing 
estimators of other population parameters that can be 
expressed as functions of totals. We illustrated the issues 
using the NIS, a telephone survey about young children and 
teens. 

Information from the survey interviews is needed to 
classify estimation units into the cell-phone-only domain, 
the landline-only domain, or the mixed domain. Reporting 
error could result in misclassifications and undermine the 
unbiasedness of the estimator, as could survey nonresponse 
in the cell-phone and landline interviews. 
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Collecting data for poverty and vulnerability 
assessment in remote areas in Sub-Saharan Africa 


Rudolf Witt, Diemuth E. Pemsl and Hermann Waibel ' 


Abstract 


Data collection for poverty assessments in Africa is time consuming, expensive and can be subject to numerous constraints. 
In this paper we present a procedure to collect data from poor households involved in small-scale inland fisheries as well as 
agricultural activities. A sampling scheme has been developed that captures the heterogeneity in ecological conditions and 
the seasonality of livelihood options. Sampling includes a three point panel survey of 300 households. The respondents 
belong to four different ethnic groups randomly chosen from three strata, each representing a different ecological zone. In 
the first part of the paper some background information is given on the objectives of the research, the study site and survey 
design, which were guiding the data collection process. The second part of the paper discusses the typical constraints that 
are hampering empirical work in Sub-Saharan Africa, and shows how different challenges have been resolved. These 
lessons could guide researchers in designing appropriate socio-economic surveys in comparable settings. 


Key Words: Socio-economic household surveys; Survey design; Data collection challenges; Sub-Saharan A frica. 


1. Introduction 


To collect economic data in small-scale fisheries in Sub- 
Saharan Africa (SSA) is challenging, as patterns and 
constraints of resource use vary considerably, i.e., spatially, 
seasonally and over time. This requires careful planning of 
the collection of data that is needed for meaningful poverty 
and vulnerability assessment. Although small-scale fisheries 
(SSF) can generate significant profits and make consid- 
erable contributions to poverty alleviation and food security, 
little information exists about their actual contribution to 
livelihoods and household economics in Sub-Saharan Africa 
(FAO 2005, 2006). The key constraints for empirical studies 
in this field are difficulties associated with data collection, 
such as remoteness and inaccessibility especially during the 
rainy season. High variability of natural resource conditions, 
and thus production, cause additional requirements for 
survey design. For preparation and implementation of a 
survey in SSA, researchers can draw upon similar studies in 
other parts of the world concerning survey methodology, 
questionnaire design, and interview procedure, e.g., the 
World Bank’s Living Standard Measurement Survey 
(LSMS) questionnaire. However, many peculiarities of rural 
communities in SSA require an adapted and elaborated 
approach. 

Some of these peculiarities are of an ecological nature, 
such as seasonal changes in access to resources and markets, 
which are directly affecting patterns and constraints of 
resource use. Others pertain to the economic side of 
household behavior, since income-generating activities of 
rural households in SSA compose complex portfolios. 


Particularly households in fishery-dependent communities 
have adopted a flexible and strongly seasonal matrix of 
diversified activities (Béné, Neiland, Jolley, Ovie, Sule, 
Ladu, Mindjimba, Belal, Tiotsop, Baba, Dara, Zakara and 
Quensiere 2003a; Béné, Neiland, Jolley, Ladu, Ovie, Sule, 
Baba, Belal, Mindjimba, Tiotsop, Dara, Zakara and 
Quensiere 2003b; Béné, Muindjimba, Belal, Jolley and 
Neiland 2003c; Neiland, Jaffry and Kudasi 2000, Neiland, 
Madaka and Béné 2005; Sarch 1997). The local populations 
are alternatively or simultaneously fishers, herders, and 
farmers, and each piece of land is potentially a fishing 
ground, a grazing area and a cultivated field, depending on 
the flood cycle (Béné ef a/. 2003a, page 20). Due to high 
vulnerability of the ecological and economic system to 
shocks, such as flood, drought and pest outbreaks which 
result in year to year variation in fish stocks and in high crop 
losses, households have diversified their activities portfolio, 
thus spreading the risk of income losses. Capturing the 
dynamic interplay of the different livelihood elements is a 
special challenge in conducting socio-economic household 
surveys. Other constraints for data collection are culturally 
determined, for example tensions between different ethnic 
groups, the existence of a multitude of languages and patois 
spoken in the study region, or some peculiarities of the 
Muslim-A frican culture. 

The data required for poverty and vulnerability assess- 
ment demand an appropriate survey methodology, for data 
quality to meet the requirements of a robust econometric 
analysis. Data needs for economic poverty assessment and 
the evaluation of SSF’s contribution to poverty and vulner- 
ability alleviation are substantial. Detailed information on 
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household income, including different income sources such 
as agricultural production, fishing, livestock rearing, off- 
farm work efc., is necessary. Also, data on the stock and 
value of productive and convertible assets, as well as on the 
distribution of consumption expenditures need to be elicited. 
In addition, information on control variables, such as 
ecological, economic or social shocks that have occurred in 
the past, subjective risk assessments, debts and liabilities, 
household composition, and others, is required. 

This paper presents the collection procedure of quanti- 
tative household data from poor households in the Logone 
floodplain, a major inland fisheries region in Northern 
Cameroon. The objective of collecting household level 
panel data in 2007-2008 was to assess the role of small- 
scale fisheries (SSF) in mitigating risk through portfolio 
diversification, thus contributing to reducing vulnerability to 
poverty. In this paper, we emphasize the requirements of the 
general methodological approach for sampling and survey 
design. Due to the complex nature of the SSF sector 
outlined above, a procedure for sampling and data collection 
is required that allows the assessment of poverty and 
vulnerability of SSF households. Particularly, the survey 
design needs to account for the high variation in income 
generating activities over time as a result of the high 
variability of access to natural resources and resulting 
adjustments in a household’s food security situation, con- 
sumption, income and assets. 


2. Study site and sampling procedure 


The study site is the Logone floodplain in the Far-North 
province of Cameroon. The floodplain covers about 8,000 
km’ and is part of the bigger Logone-Chari subsystem in the 
Lake Chad Basin, which supplies 95% of Lake Chad's total 
riverine inputs and has a basin area of approximately 
650,000 km? (UNEP 2004). Within this vast area a repre- 
sentative region was defined in collaboration with national 
experts and other key informants, while considering the 
accessibility and logistic feasibility of the study. The study 
area covers about 2,400 km’, spreading from the Maga Lake 
in the south to Ivyé village in the north, where the Logomatya 
joins the Logone River. This area is relatively densely 
populated and is characterized by rich fish stocks and 
intensive fishing, fish processing and fish trading. 

The livelihoods of the rural population in this area are 
particularly exposed to harsh climatic conditions, such as 
limited and erratic rainfall, which result in a large variation 
of production outcomes from year to year (In this respect, 
the study area is representative for many similar rural 
settings, particularly in the Sudano-Sahelian zone of Sub- 
Saharan Africa.) and thus considerable income risk. 
However, the impact is different between the sub-regions of 
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the study area. Based on Neyman (1938), as cited in Rao 
(2005), a stratified random sampling procedure was there- 
fore considered most effective. To draw a representative 
sample of households in the study area while accounting for 
different production conditions (such as access to fish 
resources), a stratification of the study site into different 
agroecological zones was undertaken. It was assumed that 
under different ecological and production conditions the role 
of fisheries in terms of income generation would differ. This 
procedure allowed capturing the whole continuum of fishing 
intensity (from specialized/full-time fishermen to purely 
agriculture/livestock rearing oriented households). 

In a second step, a complete list of villages in the study 
area (NV = 88) was compiled. These villages served as the 
primary sampling unit. Following the recommendations of 
local fisheries experts, 14 villages were selected propor- 
tional to the total number of villages per zone. The average 
village size in the floodplain (study area) is about 45 
households, with a range of 15 to 100 households. Within 
villages every second household was chosen randomly from 
household lists established by the village headman. Hence, a 
sample size of 300 households was chosen proportional to 
the size of the village populations, which equates to a 
sampling ratio of 7% of the total population (estimated at 
20,000 by the Ministry of Livestock, Fisheries and Animal 
Industries, MINEPIA). 

All selected villages were visited before commencing the 
household level survey with the aim to establish contacts 
between the researcher and the village headmen and 
conduct focus group discussions (FGDs) with the village 
leaders. The objective of the FGDs was twofold. First, some 
general information was collected such as the village size, 
infrastructure, and access to fish resources and markets. 
Second, complete household lists for every selected village 
were compiled, since no official statistical information 
existed. For this study, a household was defined as an 
economically independent unit consisting of the household 
head, one or more spouse(s), children and other directly 
dependent members, living in the household or having 
migrated to other locations. Household size varies from two 
(i.e., normally husband and spouse) to more than 15. Large 
households are common for Northern Cameroon, since due 
to widespread polygamy household heads often live 
together with up to four wives. Mostly, households do not 
live separately from other kin households, but usually form 
a clan, living together in a larger compound. However, 
within the compound, households are independent from 
each other. During the visits, special attention was paid to 
list the names of individual household heads and not only 
those of the compound/clan leaders. The additional informa- 
tion collected during the FGDs was necessary to get a first 
understanding of the livelihood options and constraints in 
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the study area, which proved to be helpful for the devel- 
opment of the household questionnaire. In the last step, the 
compiled household lists were used for a weighted random 
sampling of the 300 sample households. 


3. Survey design 


Seasonality is an important characteristic of the live- 
lihood conditions in the Logone floodplain. Therefore, in 
order to capture seasonal variation, the survey was designed 
to yield a two-period panel data set (2006 — 2007), with an 
additional third survey six months after conducting the 
baseline survey (see Figure 1). The baseline survey was 
accomplished right at the end of the dry season, when 
income-generating activities are extremely limited, and the 
financial resources, generated during the rainy season in 
2006, are being used up. The period covered in the baseline 
survey was May 2006 to April 2007, constituting a stock 
check of average income flows, consumption expenditures, 
and an asset inventory. The first follow-up survey captured 
the busy time of the year, where expenditures rise due to 
investments (e.g., purchase of new fishing nets and other 
productive assets), and variable production costs in agri- 
culture and fishing. Finally, the second follow-up survey 
covered the second half of the year, giving account of the 
economic household activities in this period. This approach 
was chosen to improve the accuracy of data on livelihood 
activities by reducing the recall period, and to make sure to 
capture seasonal variation in income and consumption. 


: sh ies dc 
Baseline survey 1 follow-up 2"" follow-up 
Considered period: Considered period: Considered period: 
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Source: own illustration 


Figure 1 Livelihood options in the study area and design of 
the survey 


Before the start of each survey, enumerator training 
workshops of 3 to 4 days were conducted, including pre- 
testing of the questionnaire in order to detect weaknesses 
and the necessity to eliminate, rephrase or add additional 
questions. The baseline pre-test was carried out in two 
villages of zone | and 2, in order to test the suitability of the 
questionnaire for different livelihood conditions. The 
baseline study was completed within 3 weeks in May 2007 
by four enumerators, working in a team, and accompanied 
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and directly supervised by the first author. This procedure 
gave the opportunity for immediate cross-checking for 
missing information, and also enabled the researcher to 
observe and reinforce interview techniques and immediately 
discuss problems or questions. 

Due to the relative remoteness of the villages and 
difficulties of access, careful logistical planning was neces- 
sary. The field trips often covered several days, and it was 
inevitable to spend the nights in the villages. Hence, the 
survey procedure adopted was as follows: the whole team 
arrived in a village, presenting itself to the village chief, who 
had been previously informed about the arrival date of the 
team during the FGD visit. The chief then called the heads 
of the selected households to a central meeting place, 
usually under a tree in front of the chief's house. After the 
interview, which normally took about one hour, the 
respondent was given a small present as a compensation for 
his time (a package of sugar and a bag of tea), and the next 
household head was called to sit down. Working in a group 
enabled the team to finish a village in about one or two days 
and proceed to the next one. That course of action strongly 
motivated and encouraged the enumerators for security and 
psychological reasons. The interview time, and hence the 
time planned to be spent per village, was held flexible, so 
that careful cross-checking for consistency and plausibility 
of responses was ensured. Hence, during the enumerator 
training workshops and throughout the data collection 
process, special emphasis was placed on the ultimate 
primacy of data quality. 


4. Data collection challenges and lessons learnt 


This section describes some challenges and constraints in 
data collection, which have been encountered during this 
study, but which are not limited to the study region. Similar 
settings are found in many wetlands and floodplains in SSA, 
and the lessons learnt in this study may prove helpful for 
comparable data collection endeavors. 


Seasonality 


When collecting data in rural fisheries-dependent com- 
munities in SSA, the seasonal nature of the livelihood 
systems and the ecological constraints need to be taken into 
consideration. Very often, villages are spatially margin- 
alized and access is extremely difficult during certain 
periods of the year. For example, in the Logone floodplain 
in North Cameroon, access to the villages is very restricted 
during several weeks twice a year due to the annual flood 
cycle. At the beginning of the flooding season, and during 
the deflooding period, access is not possible, neither by 
vehicle, nor by boat. Hence, the placing of the survey 
periods need to be adapted to these conditions. For example, 
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although it would have been more reasonable to place a 
follow-up survey at the end of the production cycle in 
January, thus better capturing agricultural production and 
fishing harvests, this procedure proved to be unfeasible. 
From mid December to end of February access to the 
sampled villages was not possible at all. The research team 
decided for a compromise, collecting data in December, 
even if this falls in the midst of the harvesting season. The 
missed data on yields and income was then recollected 
during the second follow-up. Similar problems arise in other 
major inland fisheries such as the Hadejia-Nguru Wetlands 
in Nigeria or the Lower Shire river basin in Malawi. 


Defining time periods 

For recall surveys and particularly for panel surveys (i.e., 
the research team is repeatedly revisiting the same house- 
holds) it is important to assure a common understanding of 
the time period that is considered in the questionnaire. 
Different notions of the time span may result in biased 
information concerning income or consumption flows and 
can flaw the results and conclusions drawn from the study. 
In order to assure a common understanding of the requested 
time period, the respective cultural understanding of time 
needs to be taken into account. We found that in the Logone 
floodplain, people do not think in time units such as weeks 
or months. Hence, questions, such as: “How much did you 
spend on food items in the last 6 months?” were not 
appropriate. In this case, it proved instrumental to refer to 
certain region-wide acknowledged social events or cele- 
brations. For example, the survey in November coincided 
with the Tabaski festivities, so that it was easy for the 
respondents to delimit the time period considered in the 
second follow-up survey. 


Selection of enumerators and their cultural competence 


Perhaps the most important factor in empirical work is 
the choice of the enumerators. To achieve good data quality, 
enumerators must not only provide the needed skills and 
knowledge, but also dispose over additional soft skills, such 
as mastering of languages, social competence, and the will 
to work under severe conditions. 

The lack of sufficiently educated interviewer personnel 
in the Far-North Province in Cameroon presented a serious 
constraint. For this study, a team of five MINEPIA staff, 
who work as government officials in the survey area, was 
recruited as enumerators. While respondents can have 
reservations to provide information to government officers, 
the more important factor was that the survey team 
represented the two ethnic groups of the study area. Also, 
enumerators spoke the languages of the region, they were 
familiar with the local peculiarities, and used to the 
conditions in the field. In addition, respondents’ willingness 
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to provide information was actually encouraged in expec- 
tations of a follow-up governmental support. 

Another advantage of the selected enumerators was 
awareness and sensitivity towards ethnic tensions. Enumer- 
ators were careful not to take sides with either one of the 
involved parties, and avoided offensive statements. This was 
especially important with regard to multiple visits of 
villages and respondents during the follow-up surveys. Any 
disaccord between respondents and enumerators would have 
resulted in significant attrition and the need to drop entire 
villages from the sample. 

Certain cultural or religious norms also demanded 
tactfulness and respect. For example, in a number of villages 
only men could be interviewed since women in that 
African-Muslim culture are not allowed to meet or talk to 
men other than direct family members. In cases where the 
household head was not present at the time of the visit, it 
was not possible to interview the spouse (or any other 
woman in the household) instead. An adult male household 
member had to be chosen to provide the required infor- 
mation. For the same reason, interviews could not take place 
in the house of the respondents. For the sake of compliance 
to these cultural norms, the interview procedure had to be 
adapted. Instead of visiting the chosen households one by 
one, all sampled household representatives in each village 
were called to a central meeting place by the village chief 
(usually in front of the chief’s house). If the household head 
was not present, another adult member of the household 
(usually male) was interviewed. The enumerators then 
seated themselves at a distance of about three to five meters 
from each other, calling the respective respondent to be 
interviewed in private, while the others were waiting for 
their turn. 


Sample attrition 


A particular challenge of panel surveys in general is to 
maintain the size of the sample over time (Jackle and Lynn 
2008, Laaksonen 2007). Attrition can be high due to several 
reasons. For example, in some cases the household head has 
died, the whole household has moved away, or the 
respondents lose interest to participate especially if no or not 
enough incentives are provided. The loss of willingness to 
participate in a follow-up survey caused a problem during 
the second visit. Due to budget constraints the survey team 
decided not to compensate the participants for their time at 
the second visit. For the baseline survey, each respondent 
had received a box of sugar and a package of tea which 
turned out to be a strong extrinsic incentive. When 
households learned that no remuneration had been foreseen 
at the second visit, 69 households (23% of the total sample) 
announced that they were “too busy” to participate. Consid- 
ering this reaction, compensation was again offered at the 
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third survey, so that most of the lost households could be 
regained. They were even willing to respond to both 
questionnaires (1“ and 2” follow-up). Thus the missing data 
could be completed during the last survey round albeit at the 
cost of lower reliability due to memory bias. Such 
respondent behavior is consistent with findings by Jackle 
and Lynn (2008), who report significant positive effects of 
continued incentive payments on attrition, bias and item 
non-response. At the end of the survey period, 14 house- 
holds (4.7%) have been lost due to permanent migration or 
other reasons, and hence were removed from the sample. 


5. Summary and conclusions 


Data collection for poverty analysis in SSA is a chal- 
lenging endeavor. Often, cultural, ecological and economic 
constraints push researchers to put up with a compromise 
between data quality and feasibility of the study. On the 
other hand, collection of such data is important because little 
is known about poverty and vulnerability of marginalized 
groups such as fisheries communities in remote areas of 
SSA. In this paper, we present the approach that has been 
taken in the course of a study on poverty and vulnerability 
in the Logone floodplain, which is a major fishing area in 
Northern Cameroon. We identify typical constraints that are 
often hampering empirical work in SSA, and show how 
different challenges can be overcome by an adequate survey 
design, sampling and careful application of the survey 
instrument. Major constraints encountered were the diffi- 
culties to access the target population, limitations in finding 
qualified enumerators and high demand for cultural sensi- 
tivity of the research team. 

Of eminent importance is a close collaboration with local 
authorities and experts in the respective field of research, as 
well as a good understanding of and compliance with local 
cultural norms and values. Learning from the local popu- 
lation and empathizing with it’s peculiar ways of living 
before starting the survey per se has been found to be a key 
success factor for working in that region. Summing up, it 
can be concluded that despite a number of difficulties, 
quantitative data collection in rural Sub-Saharan Africa is a 
task that can be completed with satisfying results. An 
appropriate survey design and interview procedure devel- 
oped in collaboration with local staff and experts can assure 
adequate data quality for economic poverty and vulner- 
ability analysis. 
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Respondent differences and length of data collection 
in the Behavioral Risk Factor Surveillance System 


Mohamed G. Qayad, Pranesh Chowdhury, Shaohua Hu and Lina Balluz ' 


Abstract 


The current economic downturn in the US could challenge costly strategies in survey operations. In the Behavioral Risk 
Factor Surveillance System (BRFSS), ending the monthly data collection at 31 days could be a less costly alternative. 
However, this could potentially exclude a portion of interviews completed after 31 days (late responders) whose respondent 
characteristics could be different in many respects from those who completed the survey within 31 days (early responders). 
We examined whether there are differences between the early and late responders in demographics, health-care coverage, 
general health status, health risk behaviors, and chronic disease conditions or illnesses. We used 2007 BRFSS data, where a 
representative sample of the noninstitutionalized adult U.S. population was selected using a random digit dialing method. 
Late responders were significantly more likely to be male; to report race/ethnicity as Hispanic; to have annual income higher 
than $50,000; to be younger than 45 years of age; to have less than high school education; to have health-care coverage; to 
be significantly more likely to report good health; and to be significantly less likely to report hypertension, diabetes, or 
being obese. The observed differences between early and late responders on survey estimates may hardly influence national 
and state-level estimates. As the proportion of late responders may increase in the future, its impact on surveillance 
estimates should be examined before excluding from the analysis. Analysis on late responders only should combine several 


years of data to produce reliable estimates. 


Key Words: BRFSS; Responders; Differences; Length of data collection. 


1. Introduction 


The Behavioral Risk Factor Surveillance System 
(BRFSS) is a state-based household telephone survey in the 
United States (U.S.) and its territories which monitors health 
risk behaviors and chronic disease conditions for the adult 
noninstitutionalized population (Centers for Disease Control 
and Prevention [CDC] 2009a, BRFSS Turning Information 
into Public Health, http://www.cdc.gov/brfss/about.htm). It 
is the largest telephone survey in the world and is 
implemented by the 50 states, the District of Columbia, and 
U.S. territories, in collaboration with the CDC. The survey 
is conducted continuously throughout the year. 

CDC dispenses the samples (phone numbers) to states 
quarterly. At the state level, the samples are divided into 12 
monthly lists for operational purposes. Trained interviewers 
call each sampled telephone number. After each call to a 
sampled telephone number, a disposition code is assigned. 
States and their contractors are required to give final 
dispositions to their monthly released samples within that 
month. Over 90% of the monthly samples and completed 
interviews receive final dispositions within 31 days. States 
continue to complete their remaining samples afterwards 
(Qayad, Balluz and Garvin 2009). 

Because of economic downturns, states and survey 
organizations may face budget cuts that could adversely 
affect their survey operations. Such unforeseen circumstances 
warrant searching for alternative operational strategies. A 


cost-effective alternative could be to end data collection at 
the end of each month. However, ending data collection 
within one month excludes interviews completed after 31 
days. Such exclusion could influence the variability of the 
respondents, surveillance estimates and the size of com- 
pleted interviews, which could affect other operational 
decisions. Currently, the size of late responders is small and 
may not influence surveillance estimates. However, the 
current trend in survey responses heralds a continuous 
decline in survey responders, which could prolong the 
duration to reach respondent and the eventual increase in the 
proportion of late responders. Such circumstances require 
thorough examination of the influence of late responders on 
surveillance estimates in the future. This study examines 
whether respondents who completed the interviews within 
31 days and those who completed after 31 days are different 
in demographics, risk behaviours, and chronic disease 
conditions. 


2. Methods 


We used the 2007 BRFSS data, which is an ongoing 
state-based random digit dialing (RDD) telephone survey 
among the non-institutionalized civilian population in the 
US. We divided the duration of the interview into two 
periods, 0-31 days and >31 days. Respondents who 
completed the interviews within 31 days (referred as early 
responders) and those completed after 31 days (referred as 
late responders). 
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Demographic factors included were - gender, race, 
income and age. Race had four groups - white non- 
Hispanic, Black non-Hispanic, Hispanic and other race. 
Education had three levels: not a high school graduate, high 
school graduate, and more than high school education. 
Income categories were <$15,000, $15,000 - $34,999, 
$35,000 - $49,999 and $50,000 or more. Age had the 
following categories: 18-24 years, 25-44 years, 45 - 64 
years, and 65 or more years. Respondents <65 years old 
who did not have any health plan (including health 
insurance, prepaid plans such as HMOs, or government 
plans such as Medicare) were considered not to have health 
plan. General health was dichotomized into good health 
(excellent, very good, or good health) and fair or poor 
health. 

Health risk behaviors included were - binge drinking, 
current smoking, (lack of) physical activity, and (insuffisant) 
fruit and vegetable consumption. Binge drinking was 
defined as having five or more drinks for men and four or 
more drinks for women on at least one occasion during the 
preceding month. Respondents who smoked =100 cigarettes 
in their lifetime and smoked every day or some days were 
classified as current smokers. Physical activity had 
following categories - meet recommendations for physical 
activity, insufficient physical activity, and do not participate 
in physical activity. Respondents who consumed 5 or more 
servings of fruits and vegetables everyday were classified as 
meet recommendation for fruit and vegetable consumption. 

Chronic conditions or illness included were Cerebro- 
cardio-vascular disease, hypertension, had high cholesterol, 
diabetes, asthma, and overweight or obesity. Respondents 
were considered to have myocardial infarction, or angina, or 
stroke or high blood pressure if they had ever been told by a 
doctor, nurse, or other health professional to have 
myocardial infraction or stroke or high blood pressure 
respectively. Respondents were classified as having high 
blood cholesterol if they had checked their blood cholesterol 
and was told by a health professional that their blood 
cholesterol was high. Respondents were classified as having 
diabetes if they had ever been told by a doctor that they had 
diabetes. Asthma was self reported and physician or health 
care professional diagnosed; it had three categories - current 
asthma, former asthma, and never asthma. Self-reported 
weight and height were used to calculate Body Mass Index 
(BMI) (BMI = weight[kg]/(height[m])*). Participants were 
classified as overweight if their BMI was >25 kg/m and 
were classified as obese if their BMI was >30 kg/m’. 

We estimated the percent differences between early and 
late responders by demographics, health behaviors and 
chronic health conditions or illness. We used SUDAAN and 
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SAS for the analysis (SAS Institute Inc., Cary, NC, USA 
2004). 


3. Results 


In the 2007 BRFSS survey, there were 430,912 
interviews completed in the U.S. We excluded 14,189 
records from two states (Michigan and Louisiana) and 49 
cases with missing information. We analyzed the remaining 
416,674 respondents of which 394,427 (95%) were early 
responders, and 22,247 (5%) were late responders. We 
estimated weighted and unweighted percent differences 
between early and late responders. The absolute differences 
between the weighted and unweighted percentages in the 
variables examined ranged between 0.06% and 2.6%, 
except white non-Hispanics where the absolute difference 
was 7%. We presented the unweighted analysis for the 
purpose of this study. 

Significant differences were observed between early and 
late responders in demographics, access to health-care cover- 
age, and general health status variables (Table 1). Compared 
to early responders, late responders were significantly more 
likely to be male, to report race/ethnicity as Hispanic, to have 
annual income of >$50,000, to be younger than 45 years of 
age, to have less than high school education, to have access 
to health-care coverage, and to report good health. The 
absolute value of these significant differences in the variables 
above ranged from 1.3% to 7.6%. The percentage of 
Unknowns in the health-care coverage variable was 21% for 
late responders and 30% for early responders. The difference 
between early and late responders remained significant, even 
when we assumed the Unknowns to have a similar 
percentage of access to health-care coverage to those with 
known status in each respondent group. 

A significant difference between early and late re- 
sponders was also observed in health risk behaviors (Table 
2). Compared to early responders, late responders were 
significantly less likely to meet the recommended guidelines 
for physical activity and daily consumption of fruits and 
vegetables. The absolute value of these significant differ- 
ences ranged from 1.7 % to 3.1%. The differences between 
early and late responders remained significant even when 
the Unknowns were assumed to have a similar percentage to 
those of known status for both variables. 

Table 3 shows the differences between early and late 
responders in chronic disease conditions or illnesses. Com- 
pared to early responders, late responders were significantly 
more likely to report high cholesterol, significantly less 
likely to report hypertension and diabetes, and were 
significantly less likely to be obese. The absolute value of 
these significant differences ranged from 1.8% to 5.8%. 
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Table 1 


225 


Percent differences between early responders and late responders by demographics, health-care coverage and general health, BRFSS 2007 


Early reponders* 


Length of data collection 


Late responders** 


Difference 


(N = 394,427) (N = 22,247) (Early-late) 
Demographics % % % P-Value 
Gender 
Female 62.8 60.2 PS) 0.000 
Male Sila) 39.8 -2.5 
Race 
White non-Hispanic qu FES 7.6 0.000 
Black non-Hispanic He 8.2 -0.9 0.168 
Hispanic 7.1 S35) -6.4 0.000 
Others 515) 5.8 -0.3 0.635 
Unknown 1.0 1.0 0.0 0.977 
Income 
<15,000 9:7 8.7 1.0 0.146 
15-34,999 26.1 24.3 1.8 0.004 
35-49,999 14.1 13.4 0.8 0.252 
50,000+ 36.6 3917 3.1 0.000 
Unknown 13.5 14.0 -0.4 0.496 
Age 
18-24 3.6 4.9 -1.3 0.025 
25-44 PSV 33.3 -7.6 0.000 
45-64 40.9 40.6 0.3 0.612 
65+ 29.0 20.2 8.8 0.000 
Unknown 0.8 1.0 -0.1 0.827 
Education Level 
<High School 10.3 12.3 -2.0 0.001 
High School Graduate 30.6 28.7 1.9 0.001 
> High School 58.8 58.2 0.6 0.177 
Unknown 0.3 0.8 -0.5 0.264 
Health care coverage (<65 years) 
Yes 593 65.4 -6.2 0.000 
No 10.8 13.2 -2.5 
Unknown 30.0 21.4 8.6 
Health Status 
Good health 80.1 81.8 -1.7 0.000 
Fair or poor health 19.4 17.6 1.8 
Unknown 0.5 0.6 -0.1 
*Completed the survey within 31 days. 
**Completed the survey after 31 days. 
Table 2 
Percent differences between early responders and late responders by health risk behaviors, BRFSS 2007 
Length of data collection 
Early reponders* Late responders** Difference 
(N = 394,427) (N = 22,247) (Early-late) 
Risk factors % % % P-Value 
Binge drinking 
Yes 11.1 11.8 -0.7 0.261 
No 86.9 82.8 4.1 
Unknown ES 5.4 -3.4 
Smoking cigarettes 
Current smokers 18.3 NFS) 0.9 0.182 
Not a smoker 81.3 82.1 -0.8 
Unknown 0.4 0.5 0.0 
Physical activity recommendations 
Met recommended moderate/vigorous activity 43.4 41.8 ley 0.000 
Insufficinet physical activity 35.4 31.8 3.6 
No physical activity 14.3 iV ihes} 3.0 
Unknown 6.9 Se -8.3 
Fruit & vegetable consumption 
Consumed 2 5 times/day 25.0 PES) Suil 0.000 
Consumed < 5 times/day 73.0 69.7 3.3 
Unknown 2.0 8.5 -6.4 


*Completed the survey within 31 days. 
**Completed the survey after 31 days. 
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Table 3 
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Percent differences between early responders and late responders by chronic conditions and illnesses, BRFSS 2007 


Length of data collection 


Early reponders* Late responders** Difference 
(N = 394,427) (N = 22,247) (Early-late) 
Diseases/chronic conditions % % % P-Value 
Cerebral and CVD: 
Myocardial Infarction 
Yes 5.9 49 1.0 0.177 
No 93.6 94.7 -1.1 
Unknown 0.5 0.4 0.1 
Angina 
Yes 6.0 4.5 1.5 0.053 
No 93.1 94.7 -1.6 
Unknown 0.9 0.8 0.1 
Stroke 
Yes 3.8 2.8 1.0 0.183 
No 95.9 97.0 -1.1 
Unknown 0.3 0.2 0.1 
Other illnesses/conditions: 
High cholesterol 
Yes 57.0 60.8 -3.8 0.000 
No 42.3 38.4 3.8 
Unknown 0.8 0.8 0.0 
Hypertension 
Yes 35.8 30.1 5.8 0.000 
No 64.0 69.8 -5.8 
Unknown 0.2 0.2 0.0 
Diabetes 
Yes Inley 9.4 1.8 0.010 
Yes-Pregnancy 0.9 il? -0.2 
No 86.4 88.2 -1.9 
Borderline 1.4 1 0.2 
Unknown 0.1 0.1 0.0 
Asthma 
Current 8.7 Wes 1.0 0.158 
Former 3.8 4.0 -0.2 
Never 86.9 87.8 -0.8 
Unknown 0.6 0.6 0.1 
Overweight or Obese 
Normal weight 34.5 B55 -1.1 
Over weight 35.0 34.7 0.4 
Obese 26.0 SO) 2.4 0.000 
Unknown 4.5 6.2 =1.7 


*Completed the survey within 31 days. 
**Completed the survey after 31 days. 


4. Discussion 


Our study found significant differences between early 
and late responders in demographic factors, and in some of 
the health risk behaviors and chronic disease conditions or 
illnesses. This shows that the composition of the two groups 
of responders is different with respect to these attributes. 
The differences observed could be due to difficulty in 
reaching persons working long hours and being away from 
their residences. 

The greater likelihood of earning high income, being 
Hispanic, being young (18-44 years), having health-care 
coverage, having less than high school education, and 
reporting good general health among late responders fits the 
described characteristics of working people and healthy 


Statistics Canada, Catalogue No. 12-001-X 


workers (Li and Sung 1999), (O’Neil 1979). This descrip- 
tion is supported by their significantly lower likelihood of 
reporting hypertension, diabetes and obesity. But certain 
risk behaviors show a different profile among late re- 
sponders. Late responders are less likely to meet recom- 
mended guidelines for moderate or vigorous physical 
activity and for daily consumption of fruits and vegetables, 
which may be related to late responders having long 
working hours and poor access to healthy foods. 

The high income earners, who are mostly white non- 
Hispanics, and low income earners, who are mostly 
Hispanics and black non-Hispanics, may spend long hours 
in their working environments and less likely to be in their 
homes to receive survey calls (Voigt, Koepsell and Daling 
2003). In addition, BRFSS data indicate that interviewers 
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make more calls on late responders, on average almost 3 
times more than on early responders, which bears out the 
difficulty of reaching them during the 31-day survey period. 
The reasons for working long hours could be different in the 
two income groups. Hispanics, black non-Hispanics, and 
young age groups may have low-paying jobs and need to 
work long hours to make a living, while the high-income 
individuals may have jobs requiring them to remain at work 
after regular working hours. 

Surveillance and epidemiological estimates based only 
on early or late responders should be scrutinized for possible 
biases prior to making any generalizations. The percentage 
of interviews completed after 31 days is currently small 
(5%) and excluding them from the analysis may have no 
influence on national and state level estimates. However, as 
the proportions of late responders are expected to increase in 
the future, the influence of late responders on these 
estimates could not be ignored (Diehr, Cain, Connell and 
Volinn 1990). In addition, states should examine the 
consequences of ending data collection at 31 days on their 
operations, performance indicators, data quality measures, 
cost-savings and other contractual agreements with their 
data collection contractors. 

Our study has a few limitations. BRFSS uses RDD 
methodology to select telephone numbers, which is subject 
to coverage bias (Rao, Link, Battaglia, Frankel, Giambo, 
and Mokdad 2005; Frankel, Srinath, Hoaglin, Battaglia, 
Smith, Wright and Khare 2003). Information collected is 
self-reported and may be subject to recall bias in some risk 
behaviors and disease estimations (Troiano, Berrigan, Dodd, 
Masse, Tilert and McDowell 2008; CDC 2004). In addition, 
we excluded two states from our analysis (Michigan and 
Louisiana), and extrapolation of the findings to these states 
should be done cautiously. 

Despite these limitations, this study shows that late 
responders are significantly different in many respects from 
early responders. As the proportion of late responders may 
increase in the future, the influence of late responders on 
surveillance estimates should be examined carefully. 
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An interesting property of the entropy of some sampling designs 


Yves Tillé and David Haziza ' 


Abstract 


In this short note, we show that simple random sampling without replacement and Bernoulli sampling have approximately 
the same entropy when the population size is large. An empirical example is given as an illustration. 


Key Words: Conditional Poisson sampling; Entropy; Simple random sampling; Poisson sampling. 


1. Introduction 


Consider a finite population of size N and let U = 
{l, ..., k, .... N} be the set of labels of this population. A 
sample s is a subset of U and a sampling design is a prob- 
ability law p(.) on the subsets of U such that p(s) >0 for 
all s CU, and 


> POrl. 
scU 


Let m, = P(k es) be the first-order inclusion probability 
of unit k in the sample: 


Ty = Ye D(S). 


scU 
sak 
Similarly, let m,,=P(k es and fes) be the second- 
order inclusion probability of unit A and / in the sample: 


Rye = Ds p(s). 


scU 
sak, f 
The entropy of a sampling design p(.), denoted by 
I(p), is defined as 


I(p) =>, p(s) log p(s), (1) 


seQ 


where O={s|p(s)>0} is the support of the sampling 
design p(.). A sampling design has high entropy when 
there is a high amount of uncertainty or high amount of 
surprise in the sample which will be selected. In other 
words, when a sampling design has high entropy, it is very 
difficult to predict the type of sample we would obtain. 
Many sampling designs used in practice are high entropy 
designs. One notable exception is systematic sampling that 
has a very low entropy. The concept of entropy is useful in 
the context of variance estimation. When a sampling design 
has a high entropy, it is possible to obtain approximation of 
the second-order inclusion probabilities, =,,, in terms of the 
first-order inclusion probabilities, which — simplifies 
considerably the problem of variance estimation in the 


context of unequal probability sampling; e.g., Brewer and 
Donadio (2003), Matei and Tillé (2005), Henderson (2006) 
and Haziza, Mecatti and Rao (2008). 

It is well known that the sampling design with maximum 
entropy is Poisson sampling: 


HM.) 1 a-%9] 2) 


kes keU\s 


Ppoiss (s) = 


for all seQ; eg., Tillé (2006). A special case of Poisson 
sampling is Bernoulli sampling, which is obtained from (2) 
by setting x, = €(0, 1), which leads to 


N-n, 


Dee) = (Clear, somal set 


where n, is the random size of s. Using (1) and noting that 
Ys<o", P(s)= Nr, the entropy of Bernoulli sampling is 
given by 


I (Prem) = -N(U- 2) log — 1) — Nalog a, (3) 


which is maximum when m=1/2. In this case, we have 
Eiken NOR 

If we restrict to the class of fixed size sampling designs 
with first-order inclusion probabilities =,, k ¢U, the maxi- 
mum entropy design is the so-called Conditional Poisson 
Sampling (CPS); (see Chen, Dempster and Liu 1994; Deville 
2000; Tillé 2006). The CPS design can be implemented by 
repeatedly selecting samples according to Poisson sampling 
until the desired sample size, (say), has been obtained. 
When 2, =n/N for all k €U, the CPS design reduces to 
simple random sampling without replacement: 


N ll 
Pot)=| | 
nh 


for all s € Q. From (1), it follows that the entropy of simple 
random sampling is given by 


I(p,,,) = log N!—logn!—log(N — n)!. (4) 
In other words, simple random sampling without replace- 


ment is the maximum entropy design in the class of equal 
probability fixed size sampling designs. 
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Not all sampling designs possess a high entropy. For 
example, the l-in-G systematic sampling design has a very 
low entropy. Here, the number of samples, G=N/n, is 
assumed to be an integer value. Since p,,..(s)=1/G_ for all 
s €Q, the entropy of systematic sampling is given by 


ipa) = log N —logn, 


which is much smaller than (4), especially for large values 
of N. 


2. Main result 


In this section, we compare the entropy of Bernoulli 
sampling with that of simple random sampling without 
replacement. Since the support of the Bernoulli sampling 
designs is much larger than that of simple random sampling 
without replacement, we expected the entropy of Bernoulli 
sampling to be much larger than that of simple random 
sampling without replacement. Table 1 shows the entropy 
for simple random sampling and Bernoulli sampling for 
different values of N and x. Surprisingly, we found the 
entropy of both sampling designs for the same inclusion 
probabilities and the same sample size to be approximately 
equal. From Table 1, it is clear that both sampling designs 
have similar entropies, even for moderate population sizes 
(e.g., N=100), independently of the value of a. This 
result is somehow curious considering the strong reduction 
of possible samples by fixing the sample size. Indeed, recall 
that the size of the support is (") for simple random 
sampling without replacement, whereas it is 2” for 
Bernoulli sampling. For example, for N =100 and n= 20, 
the size of the support for simple random sampling without 
replacement is equal to (1°) + 5.36x10°, whereas it is 
equal to 2'°? ~1.26x10°° for Bernoulli sampling. In other 
words, the size of the support of Bernoulli sampling is 
approximately 2.36x10° larger than that of simple random 
sampling without replacement. 


Result 1. Let I(Pyen) and I(p,,) be the entropy for 
Bernoulli sampling and simple random sampling without 
replacement, respectively given by (3) and (4). Then, 


12 ee 


1 
Noo Kp ) 


Proof. By considering Stirling’s formula (see Abramowitz 
and Stegun 1964, page 257) 


iene nlogn—n = 
no logn! 


we get 


i N log N —nlogn—(N —n)log(N —n) _ 
ie N = 
: 


no log 


N-n—->o0 


is 


from which we obtain 


log 


a 
lim = oa 
No02 —N(1-—1)log(l—2)—Nnlogn 


3. Conclusion 


In this note, we showed that Bernoulli sampling and 
simple random sampling without replacement have very 
similar entropies, even for moderate population sizes. We 
conjecture that the same should be observed when com- 
paring the Poisson sampling design and the CPS design for 
a given set on first-order inclusion probabilities. However, 
the proof of this result seems to be considerably more 
complex. 


Table 1 

Entropy of (Bernoulli sampling, simple random sampling) designs 
N m= 0.1 m= 0.2 15 = (D8) t= 0.4 ™=0.5 
10 (ee, 23) (653.8) (6.1, 4.8) (sip 5:8) (OO, 5.55) 
100 (2S, SS) (50, 47.7) (61.1, 58.6) (67.3, 64.8) (69.3, 66.8) 
1,000 (S2Il, BEN) (500.4, 496.9) (610.9, 607.3) (673, 669.4) (693.1, 689.5) 
10,000 (3,250.8, 3,246.5) (5,004, 4,999.4) (6,108.6, 6,103.9) (G73 OF O3725:3) (6,931.5, 6,926.6) 
100,000 = (32,508.3, 32,502.8) — (50,040.2, 50,034.5) (61,086.4, 61,080.5)  (67,301.2, 67,295.2)  (69,314.7, 69,308.7) 
1,000,000 (325,083, 325,076) (500,402, 500,396) (610,864, 610,857) (673,012, 673,005) (693,147, 693,140) 
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ANNOUNCEMENTS 


Nominations Sought for the 2012 Waksberg Award 


The journal Survey Methodology has established an annual invited paper series in honour of 
Joseph Waksberg to recognize his contributions to survey methodology. Each year a prominent survey 
statistician is chosen to write a paper that reviews the development and current state of an important topic in 
the field of survey methodology. The paper reflects the mixture of theory and practice that characterized 
Joseph Waksberg’s work. 


The recipient of the Waksberg Award will receive an honorarium from Westat. The paper will be 
published in a future issue of Survey Methodology. 


The author of the 2012 Waksberg paper will be selected by a four-person committee appointed by Survey 
Methodology and the American Statistical Association. Nomination of individuals to be considered as 
authors or suggestions for topics should be sent before February 28, 2011 to the chair of the committee, 


Elizabeth Martin (betsy@folhc.org). 
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J.N.K. Rao, “Interplay between sample survey theory and practice: An appraisal”. Survey 
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Alastair Scott, “Population-based case control studies”. Survey Methodology, vol. 32, 2, 
123-132. 

Carl-Erik Sarndal, “The calibration approach in survey theory and practice”. Survey 
Methodology, vol. 33, 2, 99-119. 

Mary E. Thompson, “International surveys: Motives and methodologies”. Survey Methodology, 
vol. 34, 2, 131-141. 

Graham Kalton, “Methods for oversampling rare subpopulations in social surveys”. Survey 
Methodology, vol. 35, 2, 125-141. 

Ivan P. Fellegi, “The organisation of statistical methodology and methodological research in 
national statistical offices”. Survey Methodology, vol. 36, 2, 123-130. 
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